Striking a Balance: Simplicity vs. Richness in XML Architecture
Our devices are getting smarter, and developers are finding interesting ways to improve the effectiveness of applications that make our lives easier. For example, we can talk to Siri, Cortana, Alexa, or a range of other personal assistant apps, speaking a request or asking a variety of questions, and receive useful or perhaps intentionally funny responses. We also receive unintentionally funny responses or responses that are not what we are looking for, such as when Siri responded to a request for help with a gambling problem by providing a list of casinos. We are, after all, in the early stages of artificial intelligence. Currently, much of what our computers know still has to be defined by us humans.
How we provide much of that intelligence is with markup. The richer our markup, the more control we have over our content. To illustrate, we can go beyond using markup to distinguish between paragraphs, lists, and other document constructs to defining multiple types of lists that might contain different categories of information or need specific formatting (author lists, glossary lists, and so on). With that understanding, schemas have bloomed. DITA is the best known example; the latest DITA specification defines over 600 elements. However, OASIS, the organization responsible for developing DITA, is not the only group to maintain a policy of inclusion. We’ve seen a number of proprietary schemas developed in this fashion—XML architects often err on the side of caution, creating elements that are used rarely or where they might not be needed yet but may one day be useful.
In addition to creating schemas with a lot of vocabulary, XML architects have also been known to create schemas with fairly complex grammar. Sometimes the structure of the XML document relies heavily on nesting of elements. Some schemas will have a wide variety of required and optional elements and attributes. Furthermore, schema design sometimes mixes use of elements and attributes such that inconsistencies cause confusion.
The problem is that, for now, we rely on people to add XML markup to documents. Traditionally, XML authoring systems have been powerful yet difficult to use. Either the system requires users to be experienced and well trained or the authoring tool needs to have been highly customized. SyncroSoft’s <oXygen/> XML editor, for example, can be used with any XML schema you choose to define and is fairly easy for anyone with development experience to customize, but the out-of-the-box experience for the end user is not at all intuitive for someone new to XML. It’s easy for the new user to insert an element in an invalid position, and though <oXygen/> prompts the user with warnings and explanations, it can still be a frustrating experience for someone who doesn’t have the context for interpreting those warnings.
For these reasons, many organizations have resisted implementing XML-based workflows for their business-critical content. As a result, we’re in the midst of a counter-revolution in document XML intended to simplify XML. We can see evidence of this counter-revolution in schema movements, such as that of Lightweight DITA. Such proposals not only trim down the number of elements to a bare minimum but also remove much of the nesting and other structural complexities. We also see the trend to simplify document XML in tools like Simply XML and Quark XML Author, two XML authoring systems that allow users to continue authoring in the familiar environment of Microsoft Word. Quark Author, a web-based XML authoring system, uses both a lean XML schema called Smart Content on the back end and an easy-to-use interface on the front end to take XML simplicity to the next level.
The challenge, of course, is not to oversimplify or the markup loses the power that XML was intended for altogether. For instance, CommonMark, a type of MarkDown, was dreamed up as an alternative language to XML/HTML to allow users to type in a text editor and still get some special rendering, but if you need anything more complex than emphases or lists, you literally have to type out the required HTML markup into the editor. CommonMark, then, has a very limited application.
Life will continue to get easier for content creators as our tools get smarter, but while humans are responsible for the richness of intelligent content, a balance needs to be struck between simple and powerful. Go too simple, such as with CommonMark, and your authors lose the ability to add much context to the words written. On the opposite end of the spectrum, a schema too descriptive becomes difficult for authors to use effectively. The sweet spot is XML nirvana.
About Autumn Cuellar
Autumn Cuellar has had a long and happy history with XML. As a researcher at the University of Auckland in New Zealand, Autumn co-authored a metadata specification, explored the use of ontologies for advancing biological research, and developed CellML, an XML language for describing biological models. Since leaving the academic world, Autumn has been delighted to share her enthusiasm for XML in technical and enterprise applications. Previously at Design Science, her roles included MathML evangelism and working with standards bodies to provide guidance for inclusion of MathML in such standards as DITA and PDF/UA. Now at Quark Software, Autumn provides her XML expertise to organizations seeking to mask the XML for a better non-technical user experience.
Note: This article was originally published in the June 2017 issue of CIDM eNews.
The Ugly Duckling No More: Using Page Layout Software to Format DITA Outputs
DITA is growing in popularity as a document standard and is now being used across a range of industries. As DITA grows beyond the scope of technical publications and as businesses become more concerned about branding documents across the organization, the current methods of coding templates to format DITA output are no longer sufficient for document production. We’ll explore using page layout software to design complex, visually rich templates for DITA and other XML document formats.
Many organizations around the world are automating their production of business-critical content with great success. Much of the creation process can be automated by pulling from external sources such as stock databases, geolocation systems, and statistical analysis reports. Translation memory databases are growing in popularity as a method for helping automate localization of content. Publication and delivery of documents can often be performed without human intervention. Using advanced template structures, document assemblies can be pre-approved and generated at the push of a button with just-in-time resolution of content.
DITA, an OASIS XML standard for documents and best practices, has helped pave the way for content automation. DITA supports foreign content, enabling the inclusion of data from outside sources, and its specialization architecture allows publishing channels to be built on or customized from existing publishing systems.
The initial application area of DITA was computer software documentation at IBM. Up until fairly recently, DITA remained, for the most part, in the realm of technical content. However, content producers of all kinds are now finding DITA to be a useful format for a wide range of applications. DITA is being used at universities, petroleum companies (Chevron, Schlumberger), non-profit organizations (FamilySearch, HealthWise), consortiums (World Agroforestry Centre) [Schengili-Roberts 2012], financial services organizations (Mastercard), and a number of non-technical publishing companies.
The main hurdle for the adoption of DITA in non-technical applications has been the technical nature of DITA and the associated Open Toolkit, used for converting DITA XML to and from other formats. One writer notes (emphasis mine), “DITA for non-technical writers is very much a real option, with some planning and tweaking of tools and workflows” [Samuels 2014]. However, the required planning and tweaking can be a significant obstacle for resource-strapped organizations.
Among the difficulties facing non-technical content producers using DITA, perhaps the most challenging is the design of output layouts. In a recent survey conducted by SyncroSoft, a large number of respondents cited PDF customization as their biggest frustration in working with DITA [Coravu 2016]. As Hans Christian Andersen highlights in his acclaimed 1843 fairy tale “The Ugly Duckling”, some hatchlings are perceived very differently. This paper describes how page layout software can be used by non-technical designers to add complex design and organization to DITA hatchlings.
A Brief History of Page Layout
Layout design has for centuries been a visual, manual process. Books produced in monasteries in medieval times generally featured a central block of text, surrounded by an artist’s ornamental design, or illumination. Even to this day, through the invention of the printing press and later computers and printers, page layouts are sometimes modeled on these early manuscript layouts. [Novin, 2010]
Figure 1. An elaborately illuminated manuscript, dated 1413-1416. A public domain image provided by The British Library.
As soon as the graphics capabilities of computers could support it, layout design moved to the territory of software. High quality page production was opened to the masses through WYSIWYG applications ranging from word processors to desktop publishing software. One of the earliest desktop publishing programs was PageMaker, originally produced by Aldus and later acquired by Adobe. PageMaker made it possible for designers to quickly compose text and images in eye-catching layouts and then send those layouts to printers.
Computers also enable the automation of publishing, but in order to fulfill this promise, page design concepts had to be translated to a programming language to support precise replication of a design. To this end, languages such as TeX and troff were created early on, even prior to WYSIWYG design software. As various digital document formats have emerged, so too has stylesheet support for these formats, allowing templates for design elements such as paragraph and line spacing, font families, and colors to be applied uniformly to documents. Two stylesheet languages in particular are used frequently for providing templates for DITA outputs: Cascading Style Sheets (CSS) and XSL Formatting Objects (XSL-FO).
The Current Landscape for PDF Output of DITA
The DITA Open Toolkit (DITA-OT) is maintained separately from the DITA specification – it is an open source toolset for converting DITA to a variety of other formats including PDF and HTML, the most popular output formats for DITA. As most of the popular DITA outputs are XML-based and not difficult to produce, the rest of this paper will focus on PDF output, which gives users of DITA the most headaches. Print continues to be an important delivery channel. Many organizations still rely on PDF for pre-press printing. Additionally, PDF is a convenient and simple distribution channel for branded layout of longer documents intended for anyone to print. For these reasons and others, PDF garnered the top spot as respondents’ most important output format for DITA in the previously mentioned SyncroSoft survey. [Coravu 2016]
For PDF output, the DITA-OT uses XSL-FO as an intermediate step. As the Open Toolkit is free and an active open source project, the DITA-OT is widely used for producing DITA outputs, especially as it is built in to many applications offering DITA support. Therefore, XSL-FO is the primary path by which PDF output is achieved. However, there are tools that use CSS for templating and others use a proprietary approach. Finally, some DITA implementers have chosen to convert DITA to HTML or Word as the intermediate step before publishing PDF output.
Out of the box, the DITA-OT is set up to use the Apache Formatting Objects Processor (FOP) publishing engine but can be configured to use the Antenna House Formatter or the RenderX XEP engine for producing PDF. The advantages of using XSL-FO for PDF output are three-fold: the DITA-OT is already set up to produce XSL-FO, an XSL-FO formatting engine (Apache FOP) is freely available, and XSL-FO is intended for paginated outputs and can reliably handle more complex layouts than CSS. These advantages cannot be over-estimated. With almost no work required, resource-strapped organizations can get up and running with PDF output in no time, and basic customization can be performed by modifying the XSLT files that ship with the DITA-OT.
However, once an organization goes beyond requiring basic customization of the DITA-OT, the costs in time and money to work with XSL-FO increase dramatically. WYSIWYG XSL-FO software typically only offers basic functionality. Therefore, in most cases, a skilled developer is required to customize PDF output and preferably one who knows XSL-FO (not entirely common).
Furthermore, while DITA promises interoperability and the DITA-OT offers much faster ramp-up time than starting from scratch, differences in how the various rendering engines support the XSL-FO specification also require consideration. After investing in format development for FOP, for example, significant testing and refactoring is required when switching to another engine for PDF production. Small differences in rendering output matter to demanding enterprise customers who must meet specific business requirements for complex and engaging layouts. For these reasons costs of maintenance for XSL-FO can be high.
CSS is also designed for formatting and styling of content, and because CSS is widely known and easy to use, some DITA implementers have chosen to rely on CSS instead of XSL-FO. SyncroSoft, the makers of the <oXygen/> XML editor, have developed an open source DITA-OT plug-in that can convert DITA to PDF using CSS and either Prince XML or Antenna House Formatter, which can handle CSS as well as XSL-FO. Using a similar idea, some implementers first convert DITA to HTML/XHTML and then generate the PDF from the HTML using one of several applications available for this purpose, such as Prince XML.
The problem with CSS is that it was originally designed for web pages, for which pagination is not a priority. CSS2 does not have support for a number of features that XSL-FO supports, including multi-column layouts, items in margins such as footers and headers, page numbering, and cross-referencing particular page numbers. CSS3 introduced a Paged Media Module to help address some of these problems but not all [Harold & Means 2002]. Additionally, not all CSS formatting tools support the Paged Media Module. Depending on how complex the requirements are, CSS may not be sufficient.
Other Paths to PDF Output
Alternatives to XSL-FO and CSS do exist but are used more infrequently. Some implementations will convert DITA to another intermediate format such as Microsoft Word before publishing the document to PDF. The drawback here is that there are now two transformation processes to manage and two processes during which artifacts may be lost.
There are a few commercial PDF renderers on the market that do not rely on XSL-FO or CSS for formatting, including TopLeaf XML Publisher and Adobe FrameMaker. Both TopLeaf XML Publisher and Adobe FrameMaker provide a WYSIWYG interface for designing the page layout of the output PDF, but for both, this is a secondary goal, and, therefore, design functionality is neither comprehensive nor particularly easy to master. TopLeaf is built around XML; to customize DITA templates in TopLeaf, the designer must have some knowledge of DITA. FrameMaker is targeted to technical content, and as one blogger notes in comparing FrameMaker to Adobe’s page layout application InDesign, “The key question here is: How important is great, typographically-sophisticated, cool-looking, creative design to communicating technical information?” [Gold 2013]. The answer is “not very,” which is why Adobe has the two different products and which is why for “cool-looking, creative design” functionality designers do not turn to FrameMaker.
Handing DITA Output Design Back to Designers
Because of the current toolset offering, most of the design of DITA outputs is currently performed in code, by modifying XSLT or CSS files to produce the correct look for a set of documents. This is not ideal for a visual process dating back hundreds of years, of course, but it has been a tolerable state of affairs because DITA, until recently, has been used primarily for technical content, and traditionally technical content has not required particularly creative design. However, two trends are changing the landscape. Firstly, DITA’s popularity is growing for non-technical content produced by non-technical contributors, resulting in an increased demand for WYSIWYG design tools. Secondly, branding and user experience are becoming important priorities for businesses [Goodson 2012]. Branding touches all aspects of a business, including its technical publications, and user experience includes design. Incorporating high quality design into branding efforts creates a competitive advantage that businesses are using with success.
Complex page layout design is already available in desktop publishing software. Most rich-layout publications such as magazines and catalogs are built in InDesign or QuarkXPress, two tools that have carried on PageMaker’s legacy. Since today’s content automation world is built on XML, it shouldn’t be surprising that both InDesign and QuarkXPress have support for XML. This is our path forward for handing DITA layout design back to designers for producing complex, beautiful layouts that can be published to PDF and other outputs. As this author is familiar with QuarkXPress, the process will be described using Quark software, but a similar process can be applied using Adobe’s InDesign and InDesign Server.
The Process with Page Layout Software
QuarkXPress allows designers to place any number of design elements in a page layout in containers called Boxes. All Boxes in a layout have an associated unique identifier; the designer has the option to attach an easily recognizable name to the identifier. Additionally, a QuarkXPress project can include a number of other named variables and design elements. Variables can be used for static content, such as a copyright statement, or dynamic content, such as the publication date. All style preferences can be set in the project and named, from color palettes to table and list styles.
As well as being able to set the design and style preferences in a QuarkXPress project and providing identifiers for them, almost every aspect of a QuarkXPress project can be represented as XML. QuarkXPress’ XML doctype is known as Modifier. Modifier can be used to create or delete Boxes, change the properties of Boxes, such as shape or position, change the content of Boxes, change the style of the content of Boxes, and so on.
Putting the Modifier and a QuarkXPress project together is where the magic happens. The QuarkXPress project provides the template that guides the Modifier. For example, the QuarkXPress project might include a Box with the name of “Title” and two character styles, one with the name of “Main Title” setting the font size to 48pt and the other named “Subtitle” setting the font size to 36pt. The Modifier will then specify the strings to write to the “Title” Box with instructions of when to use the “Main Title” character style and when to use the “Subtitle” character style.
Figure 2. A QuarkXPress project containing a Box named "Title" and character styles "Main Title" and "Subtitle."
Figure 3. Example Modifier XML
The <ID> element specifies the name of the Box to be modified, in this case “Title.” CHARSTYLE attributes indicate the character style to apply to the contained text strings.
Because the Modifier schema is so closely related to the QuarkXPress project and because QuarkXPress is a mature design package, the use of Modifier with QuarkXPress enables organizations to create complex sets of documents. A project might consist of different layouts for different targets or various page designs for parts of chapters or articles (for example, first and last pages may use different design elements than middle pages). And all is at the control of the designer because the QuarkXPress templates dictate the boundaries within which the Modifier operates.
The engine that puts the Modifier and QuarkXPress project together and converts the XML to a new output format is the QuarkXPress Server. QuarkXPress Server can be used to automate conversion of large volumes of documents to a variety of different output formats including PDF and HTML. All that’s left in the pipeline is mapping DITA to Modifier, and given that both are XML languages for describing documents, this is a straightforward XSLT conversion. This conversion process is made even easier if implemented as a DITA-OT plug-in to leverage the DITA-OT’s ability to process DITA maps, links, and references.
Analysis of Approach
As we’ve seen, several different paths can be used for formatting PDF output of DITA content – each has its advantages and drawbacks. Let’s highlight some of the strengths and weaknesses of using page layout software.
Because there needs to be a link between the XML and the design project, designers will either need to stay within the confines of the design project or know enough about the implementation to design around it. Following on the above example, if a designer creates a new design template, s/he will need to know that the project must have a Box with the name of “Title” or the title of the document will not appear in the output, or if the designer is modifying an existing template, s/he will need to know that the Box with the name of “Title” can be modified but not deleted. This might restrict how a layout designer normally works.
On the other hand, page layout software does provide a powerful mechanism for designers to add pizazz to DITA outputs through a WYSIWYG interface. Layout design is largely a visual process that depends on seeing how elements of a layout relate to other design elements on the page, and since InDesign and QuarkXPress are both mature applications, they have extensive functionality for making this process easier for designers, from providing color pickers for matching colors to Bezier pen tools for creating interesting shapes. Additionally, in certain areas, functionality of page layout software goes where XSL-FO cannot, such as with running text along odd shapes and curves.
Finally, high quality desktop publishing systems are commercial applications – this can be either a strength or weakness depending on your point of view. Some organizations will not want to spend the money on upfront software costs and instead prefer to use their own development resources to build on open source applications like Apache FOP. Others prefer to invest in tested, supported commercial products.
Advantages and Challenges of Supporting DITA
The main advantage of supporting DITA (beyond its widespread adoption) is the existence of the DITA-OT. Thanks to a large and active open source community, the DITA-OT is already set up to process large and complex DITA documents. In preprocessing steps the DITA-OT handles such tasks as validating the XML, applying filters, resolving references, and moving metadata. The DITA-OT is then able to pass an intermediate, simplified DITA file to an external rendering process, such as the QuarkXPress Server.
The primary challenge of supporting DITA is its sheer breadth. The All-Inclusive DITA 1.3 Specification, which includes the Technical Content and Learning & Training specializations, lists over 600 elements, and this does not include the elements allowed through foreign XML languages SVG and MathML. Many of the elements are specialized off of existing DITA base elements, which means that out-of-the-box support of these elements comes with DITA’s typing architecture. However, to consider a rendering engine to have full support of DITA 1.3, the rendering engine should distinguish specialized elements from the base elements.
Regarding SVG and MathML, neither QuarkXPress nor InDesign have native support for SVG or MathML. These XML formats can be converted to static images for use within these page layout applications, but then the inherent advantages of using these formats in the first place are lost, including accessibility and interactivity.
These challenges are certainly not insurmountable, and DITA support by page layout software will continue to improve. The more difficult branding requirements introduce challenges when DITA content is reused for other business units (e.g. marketing). Many design obstacles can be handled better by products which support high-fidelity page layout, but at the cost of automation. For example, the layouts most challenging to conventional XML-based publishing engines include features like irregular-sized graphics with text wraparound and multi-column layouts with callouts anchored to relevant text content. These difficulties can now be handled automatically with XML-aware layout engines, such as InDesign Server and QuarkXPress Server, used in concert with the DITA-OT.
As DITA continues to grow in popularity as a document format for non-technical industries and as branding and user experience become important priorities for organizations, demand grows for tools to make DITA easier to use and implement for non-technical authors and contributors. Chief among the pain points for DITA implementers is PDF customization – working with code is not always feasible for layout design, a process that for centuries has been a visual, manual process, nor does it allow for the rich design key to a great user experience. Using desktop publishing software QuarkXPress or InDesign, mature products for which the primary application is layout design, is one possibility for producing high-quality, rich-layout templates for use in PDF and other outputs.
Furthermore, because QuarkXPress and InDesign both support XML, a similar process can be used for other XML-based document formats. Smart Content [White 2015] has an extensible typed architecture like DITA but is simpler in nature (only a couple dozen elements in comparison to the 600+ elements in the DITA standard) and arguably more approachable to developers familiar with HTML-related standards. SmartContent is used extensively with QuarkXPress templates and has proven immensely successful for a broad range of enterprise organizations needing content automation coupled with engaging page layout. Using page layout software, all XML documents can become swans.
[Coravu 2016] Coravu, Radu. DITA Usage Survey. <oXygen/> XML Blog, 2016, April 5.
[Gold 2013] Gold, Peter. Which is better: FrameMaker or InDesign? InDesignSecrets, 2013, September 12.
[Goodson 2012] Goodson, Scott. Why Brand Building Is Important. Forbes, 2012, May 7.
[Harold & Means 2002] Harold, Elliotte Rusty, & Means, W. Scott. 13.5 Choosing Between CSS and XSL-FO. XML in a Nutshell, 2nd Edition. Sebastopol: O’Reilly & Associates, Inc., 2002.
[Novin 2010] Novin, Guity. Chapter 58: History of Layout Design and Modern Newspaper & Magazines. A History of Graphic Design. 2010.
[Samuels 2014] Samuels, Jacquie. Everybody into the Pool: Yes, Non-Technical Writers Can Use DITA. TechWhir-l, 2014, December 2.
[Schengili-Roberts 2012] Schengili-Roberts, Keith. Who is Using DITA? DITA Usage by Industry Sector. Information Management News, 2012, March.
[White 2015] White, David. “Smart Content for High-Value Communications.” Presented at Balisage: The Markup Conference 2015, Washington, DC, August 11 – 14, 2015. In Proceedings of Balisage: The Markup Conference 2015. Balisage Series on Markup Technologies, vol. 15 (2015). doi: 10.4242/BalisageVol15.White01 .
Autumn Cuellar is a Technical Services Consultant for Quark Software. The paper “The Ugly Duckling No More: Using Page Layout Software to Format DITA Outputs” was first presented at Balisage: The Markup Conference in August 2016.
XML vs JSON
A funny thing happened on the way to XML’s world domination of the dissemination of written, document-oriented content: the data exchange world hijacked XML’s value and kept it for many years. Now JSON has the attention of web developers for data transactions – is XML in the way?
Getting Our Definitions Straight
For data (as used here ‘data’ refers to relational, or otherwise highly structured, discreet information such as financial data), XML and JSON are two sides of the same data description coin: either can be called and the game will be played. JSON works best for web-only developers, but learning XML isn’t too hard and the supporting resources are widely available with many available free and open-source.
For documents (as used here ‘documents’ means a mix of authored prose, multimedia, and data meant for presentation to a content consumer), XML is still the dominate open-standard format for semantically-rich content automation applications such as Quark Enterprise Solutions and modern word processing tools such as the Microsoft Office suite – though the purpose, use, and value of XML is significantly different between these document-focused solutions.
History Lessons for Your Markup Language of the Day
XML became an official W3C recommendation in February of 1998. At my previous company, two team members worked on the XML standard for several years alongside a who’s who of document and hyper-text technologists. The whole idea of XML, as driven by Jon Bosak, then at Sun Microsystems, was to take the benefits of SGML (Standard Generalized Markup Languages) and apply them to this new thing called “The World Wide Web.”
I remember how excited we all were when the spec was finally approved. So much attention was now being paid to our corner of the high-tech universe and the idea of having semantic XML content on the web was, to us at least, so clearly valuable. But then, the data jocks overwhelmed us document kids like a high school basketball team coming on the court after the band warms up the crowd.
EDI (electronic data interchange) methods have been around since the early days of computing. By the time XML became a recommendation, the data world was already building a new EDI method that took advantage of the web’s HTTP for the transport of messages and data payloads with the data package built using XML syntax. This EDI method was called SOAP (simple object access protocol) and when released by Microsoft and others in 1999, it very quickly became the main hype of XML’s value. All of us document folks were left playing the sad trombone sound while we continued our efforts to make semantically rich content’s value accessible and available to all (and still do today!).
Of course, all was not perfect for XML as an EDI solution. XML is a fairly verbose markup language and therefore the XML data payload can be multiple times larger than the data set it’s describing. And XML requires a robust parser, which has its own rules that were originally targeting document requirements, not the needs of more compact data structures. And lastly, many browsers were slow to adopt XML as a web standard.
It’s not XML vs. JSON, It’s Selecting the Right Tool for the Job
An oversimplification to the answer of “What is the Right Tool” is something like this:
- XML for documents (written content)
- JSON for data (transactions over the web)
Of course there are still many systems that offer SOAP APIs. Further still, the more modern REST (representational state transfer) web API doesn’t really care about the payload format, so many systems may provide both XML and JSON responses (as does Quark Publishing Platform – developer’s choice). But there are definitely gray areas when trying to determine if XML or JSON is the best fit.
Several standards exist that are used for transacting files and metadata between parties including:
- RIXML (Research Information Exchange Markup Language) used in Financial Services Investment Research publishing
- eCTD (electronic common technical document) used in pharmaceuticals for transmitting drug research to the FDA (US Food and Drug Administration)
- And other, more general metadata standards such as Dublin Core, XMP, and more
What these standards share is the use of XML to describe a package of documents in a way that lets the receiver of the package automate the handling of that package. For RIXML and eCTD, the payload mostly consists of PDF documents. The XML is used to hold the metadata that describes the package (producer, purpose, date, a description of each attached file, etc.). For the metadata “driver” or “backbone” file, XML made sense for many reasons, not the least of which was the contributors developing these standards were XML-knowledgeable folks and the tools and methodologies for creating these standards as XML were widely available.
Of course 27 characters isn’t particularly meaningful, but multiply the size of those messages by 10, 100, or 1000 and the size difference becomes meaningful. Yegor concludes that JSON is great for data sent to dynamic web pages, but he recommends XML for all other purposes.
However, his arguments against JSON were already being addressed (as he admits toward the end of the article) as the JSON world brought more tools to the party such as JSONPath and JSON rules files with validating JSON parsers. JSON features are now reasonably on par with XML, though of course still focused on solving the challenges of transacting data.
A Little More about RIXML – A Good Test Case
If you are technically minded and curious, it might be worth reviewing the RIXML Data Dictionary and jump to page 21 where the data dictionary begins in earnest. It takes a little over 100 pages to document the entire data semantics structure of the main areas of concern (not including the “sidecars” as RIXML calls them). This results in a metadata file describing the payload for a transaction of what is typically one or more PDF documents.
There is no reason why that structure couldn’t be represented as JSON, but there’s also not a particularly good reason to do so either. Ultimately what matters is which system receives the RIXML and document payload. In the case of most RIXML processing systems it is likely a backend server using Java or .NET code to parse the RIXML file and then update a database and file system according to agreed-upon business rules.
For example, take a distributor of financial research information that is produced and sent to the distributor by multiple different banks (the reason RIXML was created to begin with!). They receive the RIXML package, process it, store the information in their database or content management system and then present some portion of that information on a web page for subscribers to access. They don’t present the entire RIXML metadata – most of that would be useless to the research consumer. And a RIXML package isn’t really dynamic either – for a particular package, the metadata doesn’t change very frequently, if ever.
The distributor’s system isn’t going to rely on creating a subset of the original RIXML file to send to the browser. No, they’re going to query the system, because it is their single source of truth for content that is available. Delivering query results from the system as JSON to the browser is easier than creating or re-parsing or modifying the original RIXML file.
So an argument could be made to support both XML and JSON in RIXML (and by extension other metadata standards). Unfortunately for the JSON-only audience, the expense to recast those 100 pages of specifications for XML as JSON is non-zero, and a one-time conversion of the XML to JSON is not developer friendly. And for all of this additional effort the benefit would only apply to those that have yet to learn XML.
Long Story Bygones
There is and will always be waves of new technology that provide an alternative to, overlap with, completely replace, or partially supplant an existing technology. At the time of XML’s development, its use for transacting data was secondary to its original purpose and shows just how hungry the data world was for a better defined standard for transactions. That a more purpose-built, data-friendly format, JSON, was created at close to the same time also highlights how much need there was for improvement and standardization in data transactions.
However, XML is still a fantastic technology for handling documents, metadata, and data, and especially adept at merging all three into a common structure that can be utilized by software for automation and by humans for authoring and consuming. If you are not exclusively processing discreet data transactions, there is a lot of benefit to understanding and utilizing XML and the rich toolsets that are available.
If you are purely a web-data jockey, it would still benefit you to learn XML and the associated tools because: a) you’re likely to run into a system that provides only XML; and b) having some XML skills would extend your opportunities to cool things that can be done in the document content domain.
Metadata: Really Cool…When You Don’t Have to Mark It
Earlier this month, the announcement that Siri’s creators had successfully placed an order for pizza with voice commands made a splash in the technology community because the pizza ordering process is a fairly complex process for dumb machines. The success of the pizza order is one realization of the Semantic Web envisioned by Tim Berners-Lee, a vision that sees computers intelligently communicating with each other to automate complicated tasks. If the voice-commanded pizza order is any indication, the future is bright for increased productivity through digital assistants.
Much of the promise of the Semantic Web is built on a platform of metadata, which is used to identify data in a machine-readable fashion. For example, metadata can be used to differentiate between types of doctors so that someone asking their digital assistant for a nearby doctor isn’t directed to a doctor of veterinary medicine.
Metadata can be useful wherever a bit of machine intelligence is needed, including in an organization’s business-critical content. Take, for instance, online policy documents that are peppered with important terms defined in a glossary: it might be useful for key terms to be automatically linked to the definition of those terms, or a window with the definition to appear when your audience hovers over the term. For another example, financial reports may need public companies automatically linked to stock data.
In the above examples, glossary terms and public companies need to be marked as such for the system to identify where linking or hover behavior needs to be added. The problem is that leaving the marking responsibility to your subject matter experts (SMEs) is an undue burden. First, manual marking of metadata is an inefficiency that wastes your SMEs’ time. Second, the potential exists for metadata-appropriate content to be overlooked, which means that your audience may not receive the full benefit of your metadata-enriched processes.
Automated marking of metadata is one way in which Quark’s content automation expertise can help organizations enrich content processes — documents can be scanned for glossary terms, public companies, or other such important words or phrases and have the necessary metadata inserted at key points automatically using Quark software. Organizations can place the power of the automated metadata marking into their SMEs’ hands so that they can see where the metadata is being added and supplement or remove the marking as needed. Alternatively, the metadata can be added to content during the publishing process without any human intervention. Either way, your published content will be all the more rich for the intelligence added through metadata.
The Beginner’s Guide to Smart Content
For decades technical writers and technical publishers have reaped the benefits of XML to lower the cost and effort associated with creating, managing and reusing content across multiple output formats. Now, with the introduction of Smart Content, business users and subject matter experts can easily adopt XML in order to keep up with consumer demand for high-value communication.
Download the free eBook “The Beginner’s Guide to Smart Content” to access a look at the evolution of XML and Smart Content, with chapters that include:
- What is Smart Content?
- A Brief Primer on Publishing Processes
- The Cost of Smart Content
- Who Else Is Using XML for Document Production?
- What’s Wrong with XML?
- Smart Content Details
- 12 Reasons to Adopt Smart Content