E-books can be a proofreader’s nightmare. Often, those nicely laid-out print PDF pages sport a bewildering array of typos and formatting problems once converted into ePUB. Time and extra effort—both of which cost money—then have to be taken to correct them. For publishers and content services suppliers, ensuring a smooth and quick PDF-to-ePUB conversion—sans typos and formatting errors—has taken on a new urgency now that e-books are selling so well, and converting print titles into e-books is on top of everybody’s digital agenda.
“People may think that converting from a PDF is like copying text from a Word document: you just have to take the formatted text and save it as HTML to produce an e-book,” says COO David Raj of Chennai-based Newgen Knowledge Works. “Unfortunately, getting the text out of a PDF is much more complex. Unlike word-processing or typesetting files, which are preferable starting points for e-book conversion, PDFs aren’t built around the logical flow of words and spaces, paragraphs and images. That is, there may be no internal information about the structure or sequence of the book. At heart, a PDF simply says, ‘At such-and-such coordinates on this page, draw this letter, then at the following coordinates, draw this shape, and so on.’ And this gives rise to the many conversion errors that readers complain about in many e-books today.”
What are the most common errors in PDF-to-ePUB conversion, and how do these errors occur during the conversion process?
Let’s start with the worst error—and repeat offender in many e-books: missing and extra spaces between words and around punctuation. Because PDFs do not necessarily preserve words as individual units of content separated by spaces, conversion programs often have to guess where a space should come. If they guess incorrectly, then either a single word is broken into two or you have two words joined into one.
Hyphenation poses another problem. So long as conversion software has to determine whether a hyphen is soft (inserted to break the word at the end of a line to make the page look nice) or hard (in words such as “e-books”), 100% accuracy will not exist. Consistently differentiating one kind of hyphen from the other is beyond the capabilities of most conversion software in the marketplace.
The same goes for correctly determining text styles such as bold, italics, underline, subscript or superscript. The choice of fonts (serif, sans serif, script, fringe, old style, modern), amount of kerning and line spacing, and usage of reverse type in creating the original PDF can affect the accuracy of the conversion process.
Special characters—accented letters, non-Latin alphabets, math symbols, and so on—are another dilemma for conversion software if the typesetter or author has used a font that does not follow the Unicode standard. While building character conversion tables for such special symbols is useful, it is impractical to have them built for every font imaginable. So most conversion tools throw out gibberish upon encountering special characters.
As if the above problems are not enough, nicely designed PDF pages that are created from software such as InDesign, QuarkXPress, and Word, as Raj of Newgen says, do not provide any information on the structure of the document. Multiple columns, for instance, may not be recognized in PDFs. So conversion software may read across the entire page rather than down each column in turn. And an error in recognizing and breaking apart the columns can cause lines from different columns to jumble up, resulting in total misrepresentation of the text.
Similarly, conversion software may not successfully distinguish sidebars or inset stories from the main text. And the PDF has no such concept as a hard return that separates one paragraph from the next. So the possibilities of subtext intermingling with the main text, or a few paragraphs running together as one, are high. Alternatively, each line of a paragraph might become a separate paragraph after conversion. Or the conversion program might consider page furniture such as headers and footers to be a part of the main text, and convert them accordingly. In an e-book, this would mean repeated lines of headers and footers at places where they should not appear.
The biggest challenge for a conversion program lies in deciphering tables (with their headers, columns, rows, and cells), mathematical equations (created by sophisticated equation authoring tools), and graphics (which sometimes contain text or captions as well as images). Recognizing these elements as discrete units and separating them from the main text (or other elements) is beyond many programs’ capabilities.
At Newgen, a program using the tools of natural language processing and document recognition to understand a PDF and its structure is set to rectify common PDF-to-ePUB conversion errors. Silk (“as in smooth as silk”), an end-to-end product for high-quality conversion of any print PDF file to ePUB 2 or 3, launched at the January 2012 TOC conference in New York City. Publishers’ reactions to Silk have been positive thus far. “Clients are impressed by its combination of automated and semiautomated tools, reduction in time-to-market, and high-quality results,” adds Raj.
Silk’s conversion algorithms have produced high-quality results while maintaining as much fidelity as possible to the input PDF. In fact, many times the tool picks up errors in the print PDF that the publisher can then correct in the ePUB file. Says Raj, “Silk first scans the PDF to separate, and delete, page furniture, to distinguish tables, figures, and equations as stand-alone units, and to determine reading order for the text on the page and from page to page. It runs a spellcheck to catch clubbed or split words, and an internal consistency check to differentiate soft from hard hyphens. It uses visual and semantic clues to determine where paragraphs begin and end, to link footnotes back to their cues, and to tag up cross-references within the document. And all of this is done within a minute for a standard 300-page book. After this, Silk leads the user through each potential conversion problem that it has spotted, showing the original PDF and converted e-book side-by-side.”
Smart annotations drawn on top of the PDF content help users grasp the formatting as well as numerous control and function shortcut keys. Adds Raj, “Silk’s intuitive interface also means that the user does not need to have any proficiency in HTML to be able to use the controls or functions. Scripting options are available to quickly and efficiently make custom or complex global changes to the ePUB/HTML file.”
Presently, Silk is used in-house to make Newgen’s process more consistent along providing faster delivery, but Raj confirms that the SaaS (Software as a Service) model will eventually be available to clients.
With more complex PDFs and e-book layout yet to come, we will need ever more sophisticated solutions. The goal of achieving 100% error-free e-books and fewer complaints about e-book quality will continue to drive the search for fast, reliable, and automated PDF-to-ePUB conversion tools. In the meantime, all publishers should know that converting PDF to ePUB is not as easy as they want it to be. Manual and semimanual tools are still needed to get to an error-free ePUB/HTML file. Someone still has to push the button and scan through the lines to make sure a human being will read everything correctly.