There’s no question that Robert Mueller’s investigation of Russian interference with the 2016 presidential election and the subsequent release of the Mueller report is the most powerful political story of the decade. It’s been holding us spellbound for over two years now, and with Mueller himself set to appear before Congress later this month, the drama will continue. But many people don’t realize that it’s a fascinating publishing story as well.
Do you remember all of the delays to the release of the report? All of the speculation? What revelations would it contain? How long would the report be? On April 18, it landed with a heavy thud—448 pages of dense text, footnotes, citations, and redactions.
As we write this, the Scribner edition of the report (with, per the publisher, “exclusive analysis by the Pulitzer Prize–winning staff of the Washington Post”) remains at #1 on the New York Times paperback bestseller list, 11 weeks since its release. NPD BookScan data shows some 250,000 print copies sold of that one edition, which competes with the also-successful trade-published versions from Melville House and Skyhorse. There are eight large-print editions. Several illustrated versions are in the works, including a graphic novel rendition by Barbara Slate. The full range of self-publishing’s power is on display on Amazon, where you can buy any of several dozen e-book versions, for 99¢ or a few dollars each.
All of this despite the report being available as a free download on the U.S. Department of Justice website. But what do you get for free? A 140 MB PDF file.
Is it somehow poetic that the blind are unable to read the Mueller report? Or, more accurately, the print disabled. A day after its April 18 release, Duff Johnson of the PDF Association wrote a scathing critique of the report—not of its content but of the format of the digital file. The 448-page PDF appeared to be just a scan of a printout of the version handed over by Mueller’s team. The report, Johnson wrote, was not searchable, the file contained no text, and it was not tagged. “Unfortunately,” he wrote, “the image-based PDF the Department of Justice delivered is the least easy-to-use of any option they could have chosen.”
Johnson tells us, “I figured it would be hideous in some ways; I just didn’t expect all the ways they would make it hideous.” As he points out, the DOJ (along with most government agencies) has an unambiguous policy of ensuring that public documents comply with Section 508 regulations—which include making them accessible to users with print disabilities. This file wasn’t even close.
On April 22, the Justice Department quietly posted a second version of the report, replacing the first. This new file had been run through some OCR (optical character recognition software that converts images to text to attempt to at least make the text searchable and readable by text-to-speech software). Some tags and bookmarks were added. But, once again, it failed the accessibility test—and there are thousands of errors in the capture, which is a problem for everybody.
Both the New York Times and the Washington Post made a commitment to their readers to fully explore the report. Faced with the redaction-filled scan (a Wall Street Journal analysis showed that 2,050 of the 16,500 lines in the report were redacted, over 12%), the publications each pulled together its own team to quickly dissect the PDF, correct the OCR errors, and release full-text versions via their websites.
We’ve since learned from Tiffany Fehr, an assistant editor for interactive news at the New York Times, that the Times had a team of 22 newsroom volunteers working to deliver a full-text version less than 24 hours after the report’s release. The Times used custom in-house document-processing software to OCR the file, getting a pretty good result out of the gate. The subsequent by-hand cleanup centered around recreating line breaks and then detecting patterns to separate footnote text from body text. The outcome was a file that was accurate but still not properly accessible.
Documents typically use visual cues to convey information. The redactions in the Mueller report are treated visually, and so they convey little or nothing to print-disabled users. The main thing is the structure of the document: assistive technology relies on HTML heading tags for navigation; just making headings bold or all caps doesn’t cut it. Image descriptions need to be included, and there is special markup (called ARIA) that enables assistive technology to provide what formatting conveys to sighted users.
Having read Johnson’s critique of the many problems specific to the Mueller report, Bill and I started to ponder what it would take to create a fully accessible edition of the report. Bill brought into the discussion Richard Orme, CEO of the DAISY Consortium, the leading organization for promoting reading accessibility worldwide. He provided an in-depth analysis of how to tackle the redaction issues, giving us confidence that a conforming solution was possible.
Over at the New York Times, Fehr tells us, the team spotted numerous link opportunities where they thought a reader might want to click through to read the source material. The document contains 2,390 footnotes (including 3,355 specific references), spread across two volumes and four appendices. In the PDF released by the DOJ, only 14 of these references feature hyperlinked URLs leading to live web pages.
Initially the Times focused on all the cited news reports, but it soon grew to include links to government documents, press releases, congressional testimonies, U.S. law, and more. Fehr says, “many of those sources are far more readable than a reader might expect, so it seemed like helpful context to add to the report.”
The Digital Public Library of America (DPLA) is an all-digital library that aggregates metadata and thumbnails for millions of photographs, manuscripts, books, sounds, and moving images from libraries, archives, and museums across the U.S. Its Open Bookshelf is a digital library collection of popular e-books free to download, and it created its own ePub version of the report.
Getting the Ball Rolling
We had begun thinking that DPLA would be an ideal partner for our effort—because a fully accessible version should also be a freely available version. We intended to contribute our efforts without cost, and so we approached Micah May, an e-book consultant at DPLA, about making their ePub accessible. DPLA too had worked from an OCR pass of the text, and then engaged its vendor, Digital Divide Data, to closely proofread and create a text-accurate file. May agreed that adding accessibility features would bring its e-book up another level.
Meanwhile, Mark Graham, director of the Wayback Machine at the Internet Archive, was thinking about links. He says that upon seeing the Times’ effort, he reached out to its team, which donated its initial list of links. He knew there were hundreds more hyperlink options in the document and built an internal group of interns at the archive to parse the info.
“We did it the old-fashioned way,” Graham says. “We went through every footnote.”
The result is some 750 hyperlinked citations, including more than 100 links to justice.gov, 80 links to Twitter, and 50 links to C-Span—links archived on the Wayback Machine to avoid link rot and content drift. Graham’s next challenge was to find a debugged e-book file to enrich with this treasure of connections.
Graham also reached out to May at DPLA. They started to discuss a version of DPLA’s ePub that would include the archive’s stable links. We joined that discussion, as well.
At the same time, Bill began a conversation with Walter Walker, president of CodeMantra, a Boston-based publishing services company. Its Accessibility Delivery Hub software automates the creation of accessible documents (with some human intervention as needed). Walker was game for putting the platform through its paces on the Mueller report as a public service, waiving the usual fees.
Now we had a strong team. We continued the collaboration with Orme of DAISY and with Johnson of the PDF Association to come up with a way of handling the redactions that would work visually in the ePub for sighted users while also conveying the same information—the reason for each redaction and its extent—to print-disabled users. This required extensive manual work that could be done well and economically by Digital Divide Data. Then CodeMantra did the processing in its system, to add all the necessary accessibility features.
DPLA’s ePub now conforms to WCAG 2.1 AA and the ePub Accessibility 1.0 specification, which means that it meets the requirements of Section 508.
This may sound like a lot of effort—and it was. From the start, the problem was that the original report was issued in such an inaccessible format. It needn’t have been. One of the most important concepts in accessibility today is the idea of being “born accessible”: accessibility should just be built into publishing workflows in the first place. That means that the people who need an accessible version don’t need a special version; they should be able to obtain the same ePub as everybody else—at the same time and at the same price.
So was it worth the effort? Of course! For something so critically important to the body politic, it is unconscionable for the report to be inaccessible to so many citizens. Thanks to the folks pulling together to make this happen, the Mueller report can be read by everybody—and for free. The new accessible version of the e-book can be downloaded from the DPLA website at https://pro.dp.la/ebooks/mueller-report.
Thad McIlroy is an electronic publishing consultant at the Future of Publishing. Bill Kasdorf is principal at Kasdorf & Associates. They are founding partners of Publishing Technology Partners.