Markup Standards for Books and Journals: Digital Solutions in India 2015

In any standards landscape, there is diversity and not likely to be a single or permanent solution. Because content often has to connect to bibliographic and business metadata, integration with other standards can be a requirement and a challenge.

Over time, standards will evolve and new standards will appear adding to the diversity of the landscape. During the past two decades, there has been a move away from proprietary publisher-specific SGML and XML DTDs and toward the adoption of industry standards. But as those industry standards evolve, adoption of new versions can take time; for example, in journals publishing, the NLM DTD was used for a long time after its successor the JATS DTD was released because journal hosting platform vendors had not made the switch to the new standard.

Content markup standards tend to be very rich vocabularies that support a wide range of functions beyond just capturing the text. Thirty years ago, rendition of the text was the core function of content markup, which was the digital equivalent of typesetting and support of the print world. The next generation of content markup moved the focus to structure (sections, components, etc.) and rendition based on that structure. That change facilitated the management of metadata within the text.

With the emergence of the Internet, content markup standards expanded to include internal and external relationships—features well beyond the world of print pages. Recently, content markup has become even more complex with support of semantic enrichment: markup of entities and concepts and pointing to external intellectual resources such as linked data.

Throughout the history of content markup standards, there have been key design challenges that have had a variety of solutions. These challenges have come and gone, and reappeared over the decades of markup solutions.

One of the earliest design issues was generated text and automated styling: does the file contain every bit of text visible to the reader or is some of it generated by styling and rendition rules? And if so, how is that generated text managed over time and archived?

Another challenge has been managing metadata: is it present in the reader-visible text or is it additional information to be embedded invisibly in the file as elements or attributes? A classic issue in structure markup from the early days of SGML has been whether the markup is presentational or semantic meaning.

A newer trend in content markup has been in response to internationalization: support for textual alternatives such as translations and transliterations, which is robustly implemented in both JATS and ePub3. As content markup moved beyond just creating printed pages to managing collections of text and media, packaging has become a key issue. The ePub standards include the most robust and elegant solution to date.

In the history of content markup languages, the most significant change has been the move from rendition to structure. TeX was a rich typesetting language; LaTeX was a set of macros built on TeX that associated formatting with structure. SGML and XML were focused on structure but rendition elements were present in varying degrees. HTML was both presentation and structure along with dynamic reflowable rendition. EPub supported reflowable rendition but now ePub3 also supports fixed layout, so the pendulum can swing both ways as standards evolve.

HTML5 is a big change from early HTML: it is now more “structurally semantic” and presentation has been moved to the CSS vocabulary. So it is much more like a document DTD in the XML tradition with rich structural markup. An amusing example of this profound change is the group of elements that in HTML4 were called “font style elements” and now in HTML5 are called “text level semantics”. The “” element definition changed from “renders as italic text style” to “represents a span of text in an alternate voice or mood.” This is a recent instance of a key markup design choice that has been on the table since the earliest days of SGML.

Content markup standards are inherently complex so there can be a wide variety in how they are used. Standardization of usage can be at various levels: within a single work, a collection, a publisher, or a genre. Usage can also change over time. These practices are defined as “usages profiles” and include how one standard invokes another. For example, ePub uses a specific profile of HTML5. Meanwhile, JATS4R (JATS for Reuse) is a newly developing profile defining a subset of the JATS standard.

When designing content management solutions using content markup standards, there are many key issues. The highest-level choice is whether to use a delivery format or a master format from which multiple renditions can be generated; the new rich design of HTML5 means that it could potentially be both. Reusability of content, present and future, is another key solution requirement. As standards will inevitably evolve over time, version control of both content and specification/standardization is critical to protecting the future use of the content.

Metadata management is a long-standing design challenge: is it embedded or extracted from the content or managed externally? A robust workflow solution has to manage not only the structured markup but also the content itself; e.g., editorial preferences such as section headings. Controlling content styling is not a functionality of markup languages but can be managed with pre-edit and validation tools that connect the markup structure with editorial guidelines.

The NISO JATS standard is the most-used standard in the journal publishing space. It is a massive content markup vocabulary with rich markup for front matter and citations. Because it is complex, usage is wildly variable across the community. EPub3, described as a “distribution and interchange format”, is a great set of components that goes beyond just content markup: package, metadata, rendition, and semantic support such as RDFa. Because of the change of HTML5 toward structure markup, the difference between JATS and ePub has diminished.

In the past, the differences between the books and journals communities were significant. Books had strong metadata needs driven early on by the book-selling process; journals later moved to significant metadata exchange. As e-book readers were industry tools, book publishers had to support industry standard applications. Online journal publishers focused on HTML and PDF on hosting platforms. In the last decade, the differences have diminished as the rise of mobile reading devices as reader platforms has impacted both books and journals. Some journal publishers started using ePub to deliver articles or entire issues.

As the standards landscape continues to evolve, long-term success will depend on leveraging community expertise and implementing standards using robust content management practices.