XML vs JSON
A funny thing happened on the way to XML’s world domination of the dissemination of written, document-oriented content: the data exchange world hijacked XML’s value and kept it for many years. Now JSON has the attention of web developers for data transactions – is XML in the way?
Getting Our Definitions Straight
For data (as used here ‘data’ refers to relational, or otherwise highly structured, discreet information such as financial data), XML and JSON are two sides of the same data description coin: either can be called and the game will be played. JSON works best for web-only developers, but learning XML isn’t too hard and the supporting resources are widely available with many available free and open-source.
For documents (as used here ‘documents’ means a mix of authored prose, multimedia, and data meant for presentation to a content consumer), XML is still the dominate open-standard format for semantically-rich content automation applications such as Quark Enterprise Solutions and modern word processing tools such as the Microsoft Office suite – though the purpose, use, and value of XML is significantly different between these document-focused solutions.
History Lessons for Your Markup Language of the Day
XML became an official W3C recommendation in February of 1998. At my previous company, two team members worked on the XML standard for several years alongside a who’s who of document and hyper-text technologists. The whole idea of XML, as driven by Jon Bosak, then at Sun Microsystems, was to take the benefits of SGML (Standard Generalized Markup Languages) and apply them to this new thing called “The World Wide Web.”
I remember how excited we all were when the spec was finally approved. So much attention was now being paid to our corner of the high-tech universe and the idea of having semantic XML content on the web was, to us at least, so clearly valuable. But then, the data jocks overwhelmed us document kids like a high school basketball team coming on the court after the band warms up the crowd.
EDI (electronic data interchange) methods have been around since the early days of computing. By the time XML became a recommendation, the data world was already building a new EDI method that took advantage of the web’s HTTP for the transport of messages and data payloads with the data package built using XML syntax. This EDI method was called SOAP (simple object access protocol) and when released by Microsoft and others in 1999, it very quickly became the main hype of XML’s value. All of us document folks were left playing the sad trombone sound while we continued our efforts to make semantically rich content’s value accessible and available to all (and still do today!).
Of course, all was not perfect for XML as an EDI solution. XML is a fairly verbose markup language and therefore the XML data payload can be multiple times larger than the data set it’s describing. And XML requires a robust parser, which has its own rules that were originally targeting document requirements, not the needs of more compact data structures. And lastly, many browsers were slow to adopt XML as a web standard.
It’s not XML vs. JSON, It’s Selecting the Right Tool for the Job
An oversimplification to the answer of “What is the Right Tool” is something like this:
- XML for documents (written content)
- JSON for data (transactions over the web)
Of course there are still many systems that offer SOAP APIs. Further still, the more modern REST (representational state transfer) web API doesn’t really care about the payload format, so many systems may provide both XML and JSON responses (as does Quark Publishing Platform – developer’s choice). But there are definitely gray areas when trying to determine if XML or JSON is the best fit.
Several standards exist that are used for transacting files and metadata between parties including:
- RIXML (Research Information Exchange Markup Language) used in Financial Services Investment Research publishing
- eCTD (electronic common technical document) used in pharmaceuticals for transmitting drug research to the FDA (US Food and Drug Administration)
- And other, more general metadata standards such as Dublin Core, XMP, and more
What these standards share is the use of XML to describe a package of documents in a way that lets the receiver of the package automate the handling of that package. For RIXML and eCTD, the payload mostly consists of PDF documents. The XML is used to hold the metadata that describes the package (producer, purpose, date, a description of each attached file, etc.). For the metadata “driver” or “backbone” file, XML made sense for many reasons, not the least of which was the contributors developing these standards were XML-knowledgeable folks and the tools and methodologies for creating these standards as XML were widely available.
Of course 27 characters isn’t particularly meaningful, but multiply the size of those messages by 10, 100, or 1000 and the size difference becomes meaningful. Yegor concludes that JSON is great for data sent to dynamic web pages, but he recommends XML for all other purposes.
However, his arguments against JSON were already being addressed (as he admits toward the end of the article) as the JSON world brought more tools to the party such as JSONPath and JSON rules files with validating JSON parsers. JSON features are now reasonably on par with XML, though of course still focused on solving the challenges of transacting data.
A Little More about RIXML – A Good Test Case
If you are technically minded and curious, it might be worth reviewing the RIXML Data Dictionary and jump to page 21 where the data dictionary begins in earnest. It takes a little over 100 pages to document the entire data semantics structure of the main areas of concern (not including the “sidecars” as RIXML calls them). This results in a metadata file describing the payload for a transaction of what is typically one or more PDF documents.
There is no reason why that structure couldn’t be represented as JSON, but there’s also not a particularly good reason to do so either. Ultimately what matters is which system receives the RIXML and document payload. In the case of most RIXML processing systems it is likely a backend server using Java or .NET code to parse the RIXML file and then update a database and file system according to agreed-upon business rules.
For example, take a distributor of financial research information that is produced and sent to the distributor by multiple different banks (the reason RIXML was created to begin with!). They receive the RIXML package, process it, store the information in their database or content management system and then present some portion of that information on a web page for subscribers to access. They don’t present the entire RIXML metadata – most of that would be useless to the research consumer. And a RIXML package isn’t really dynamic either – for a particular package, the metadata doesn’t change very frequently, if ever.
The distributor’s system isn’t going to rely on creating a subset of the original RIXML file to send to the browser. No, they’re going to query the system, because it is their single source of truth for content that is available. Delivering query results from the system as JSON to the browser is easier than creating or re-parsing or modifying the original RIXML file.
So an argument could be made to support both XML and JSON in RIXML (and by extension other metadata standards). Unfortunately for the JSON-only audience, the expense to recast those 100 pages of specifications for XML as JSON is non-zero, and a one-time conversion of the XML to JSON is not developer friendly. And for all of this additional effort the benefit would only apply to those that have yet to learn XML.
Long Story Bygones
There is and will always be waves of new technology that provide an alternative to, overlap with, completely replace, or partially supplant an existing technology. At the time of XML’s development, its use for transacting data was secondary to its original purpose and shows just how hungry the data world was for a better defined standard for transactions. That a more purpose-built, data-friendly format, JSON, was created at close to the same time also highlights how much need there was for improvement and standardization in data transactions.
However, XML is still a fantastic technology for handling documents, metadata, and data, and especially adept at merging all three into a common structure that can be utilized by software for automation and by humans for authoring and consuming. If you are not exclusively processing discreet data transactions, there is a lot of benefit to understanding and utilizing XML and the rich toolsets that are available.
If you are purely a web-data jockey, it would still benefit you to learn XML and the associated tools because: a) you’re likely to run into a system that provides only XML; and b) having some XML skills would extend your opportunities to cool things that can be done in the document content domain.