Digital Solutions in India 2018: A Quick Check on Metadata

At its most basic level, metadata is about 5W1H—the who, what, where, when, why, and how of a data set, object, or resource (such as an article, a book, or a journal). Gathering these basic bits of information and plugging them into the system should be easy, right? You wish.

As Marianne Calilhanna, marketing director at Cenveo Publisher Services, puts it: “Metadata makes the content world go around, but the metadata struggle is still very real, especially in these four areas: the lack of consistent information, controlled vocabularies, revision control, and author-supplied vs. publisher-supplied vs. vendor-supplied metadata.”

Complexities Abound

Calilhanna uses the simple example of publication date as an example. “Think about all the types of metadata that one needs for a journal article: preprint date, online date, print publication date, and issue title date. Then combine that with the various ways dates can be represented—day, month, quarter, year, for instance—depending on the publisher. Throw in dates of any revisions and things become complicated. So, when combining a number of journal articles from various publishers on a platform, how do you bring consistency for search and discovery?”

It only gets more complex as publishers merge or acquire content from other publishers. What happens when a journal moves from one publisher to the next, or from a society self-publishing model to a commercial publisher? Will the article DOIs be the same, but just updated for the URL to a new website—just like when a publisher changes its hosting vendor? For Calilhanna, “All these require deep thinking and planning, and a service provider who understands not simply what needs to be changed, but also why, and offer a reasonable workflow to manage the changes.”

Let’s go back to the sources of metadata in the publishing industry. There are primarily two sources: the publisher’s system (i.e., the core metadata) and the submission system (enhanced metadata). “The enhanced metadata, which is mostly fed by authors during the submission process, is prone to typos and outdated or incorrect information,” explains Rahul Arora, CEO of MPS. “Author’s contact address, for instance, may have changed from when the original submission was made to when the article was finally accepted through the peer-review process. But it may not be updated in the submission system.”

So metadata verification and validation is crucial. “This is a core production activity at MPS: Our production and workflow tools are tightly integrated, and communicate effectively, with the metadata received from publishing clients. Discrepancies in information between the metadata and manuscripts are identified and queried. For some metadata fields—funding information, and ORCID details, for instance—MPS tools validate the manuscript details against relevant online database APIs to correct and/or query the author or publisher as recommended. This is to ensure that such critical metadata elements are rendered accurately in the systems,” adds Arora.

In the newspaper and magazine segment, metadata is captured at issue, page, and article levels, and the XML standard adopted for archival digitization by libraries and content aggregators worldwide is METS/ALTO. “Metadata can be harvested in a variety of standard formats, such as Dublin Core, NISO, PREMIS, RDA, XMP, MPEG-7, IPTC, EXIF, and many more,” explains Amit Vohra, founder and CEO of Continuum Content Solutions. “But the creation and management of metadata is a challenge because while libraries and indexers usually create the metadata, sometimes nonspecialist users are also generating them, and this makes standardization difficult. A proper guideline should be followed on how, who, and when to write or remove metadata.”

Identifying core metadata elements is another issue. “Core elements should describe the origin, composition, navigation, confidentiality, and quality of the digital objects or resource,” says Vohra. “For this purpose, metadata can be further classified as administrative metadata, structural metadata, technical metadata, and descriptive metadata. It can get complex in a hurry.” And there is no unique metadata standard sufficient to describe all the documents that are emerging in various formats. “But sampling and analysis of metadata usage patterns can serve as a guide for designing core metadata sets.”

Meanwhile, there is a lot of metadata out there produced by Google on its own, observes Abhigyan Arun, CEO of TNQ, “and the Google search engine is ignoring publishers’ metadata while only recognizing those generated by its own algorithm. How will publishers react to—and deal with—this?” Another interesting development, Arun says, “is the effort to create a centralized metadata repository along with a standard representation of metadata, which is very much needed to improve interoperability between content from various publishers. Crossref is taking the lead on this, and more will follow suit, I believe.”

Publishers have spent a lot of time and effort in creating metadata, although some do realize that they are not using all of the metadata effectively, adds Arun. “There are a lot of activities in aligning the creation of metadata with their use on online platforms. Even more so, metadata creation, which was traditionally a prepress job, is now being pushed upstream. Our platform, Proof Central, for instance, is generating metadata much earlier in the production process, with the metadata vetted by the authors themselves.”

Naming a file correctly, creating a proper biography of the author, and tagging the right keywords are the key to success in metadata. “Sadly, not many digital publishers are doing this due to their lack of understanding of the importance of metadata, and so their documents get lost in the crowded content space. Providing the right information in metadata can help locate the content easily and quickly,” says Subrat Mohanty, CEO of Hurix Digital, who finds that naming the author is most essential for academic content. “It is equally important to create their profiles and include information on social media accounts such as LinkedIn and Twitter so that users can connect with the authors later on. Such information can also help published work get listed properly on various platforms and search engines, and aid discoverability.”

Monetization Through Discoverability

Metadata improves the discoverability of all books, leading to better sales—and this is the most important aspect for publishers, says Maran Elancheran, president of Newgen KnowledgeWorks, who points out: “Accurate and concise metadata helps the reader to find the right book, and ensures that it meets his or her needs before purchasing it. And by adding the right accessible metadata to e-books, the digital content thus becomes accessible, usable, and discoverable.”

Even though metadata has been around for quite a while now, it was only actively coded as a requirement by many publishers, but not meticulously by all of them. “It is all in the buzz in the last couple of years as publishers increasingly realize the relation between metadata and discoverability of their online content. But metadata does not just apply to books; it applies to articles, and now, to other digital assets as well,” says A.R.M. Gopinath, executive v-p at DiacriTech.

“When content aggregation, segmentation, customization, and reuse becomes the norm rather than the exception, metadata is very useful in locating the appropriate content both for internal and external consumption at a publishing house,” adds Gopinath. “Although it is more an art than a science, our team has been building AI tools to help our subject matter experts to tag the content with the appropriate metadata to ensure higher consistency.”

The lag between publication and discovery of research has been well documented in various studies. “We have seen academics struggle to find the right information amid the deluge of content available out there,” says Ashok Giri, CEO of PageMajik, which has created the ability for publishers to do chapter-level metadata tagging to allow for a deeper level of search functionality, and thus, easier and faster discoverability. “As the audience for research and scholarly publishing expands and changes toward digital and direct outreach, publishers need to find ways to work quickly while easily adapting their current systems to stay not only competitive but also viable. PageMajik offers such a simple and cost-effective method,” adds Giri.

As self-publishing becomes more vibrant, educating indie authors on the importance of metadata and its use in marketing their titles is a must, says V. Bharathram, president of Lapiz Digital Services. “Achieving consistency in metadata is an issue. We have seen that marketing of books becomes much easier with clean and consistent metadata, and when the metadata is optimized for search engines.”

Publishers are now using metadata significantly to monetize content, says Vidur Bhogilal, vice chairman of Lumina Datamatics, whose team is doing a number of projects in which the clients are using metadata to string together disparate strands of content to create new assets. “They are also reusing pieces of content from multiple products to create new offerings, and they are getting down to more granular chunks of content to sell. For instance, our team is helping Wiley to pull visual assets from their journals and make them available for sale separate from the articles. This ingestion of visual content into a selling system is made possible and easier due to proper metadata tagging.”

For Uday Majithia, assistant v-p in technology services and presales at Impelsys, metadata is critical not just for digital content discovery but also personalized learning. “When we talk about smart content or adaptive learning solutions, the underlying layer of metadata is what makes it all possible.”

Majithia and his team have worked with publishers to enrich digital products—books, articles, quizzes, ancillary content, and course modules—with metadata tagging. “It allows users to easily find what they are looking for through efficient search algorithms and also allows them to see related articles that would be of interest to them. And when it comes to assessment application, successful and efficient metadata tagging allows instructors to narrow down the list of questions each student receives based on his or her skill level and areas of interest.”

Metadata is so intrinsic in editorial and typesetting that it is not even a topic anymore, says Tyler Carey, chief revenue officer, of Westchester Publishing Services. “What makes metadata a topic is when it is being used for something that is not pedestrian, or generated in a way that adds value beyond the incoming metadata attached to a manuscript.”

Increasingly, Carey is seeing more publishers interested in “K&A”—key terms and abstracts—as metadata to enhance content discoverability. “Our director of technology, Michael Jensen—who is one of the founders of Project MUSE and a former manager at the National Academies Press—works with our developers in Chennai, India, to develop a semantic analysis system that combines the best of scripting and people skills to glean metadata keywords from a manuscript, and then use real, living human beings—not AI—to review and catalogue that data in a way that makes it more usable for discovery and the creation of abstracts.”

Carey and his team are currently processing 200 backlist titles for a university press through the semantic analysis system. “Many of the titles predate metadata in the way we know it now, and through this system we will create discoverability-focused metadata tags and other supporting metadata and content to add value for these titles when they are placed online.”

Metadata enrichment and content agility are areas requiring immediate action from publishers, says Sriram Subramanya, the founder and CEO of Integra Software Services. “With digital content being generated from multiple sources, discoverability becomes a pain-point for users, and consequently, for content creators or publishers. The task to infuse legacy content with metadata is laborious, but it needs to be done in today’s era of digital content convergence.”

Content seekers, adds Subramanya, can miss out on good content if the right keywords are not used to enrich the content. “Content without a proper distribution chain does not reach the end user, and to avoid this, infusing content with proper metadata becomes critical.”

Digital Solutions in India 2018: A Quick Check on Metadata

Tracking its issues, progress, and real-world application