London Book Fair 2023: The Art of the Pause in Audiobooks

"You won’t be needing your studios any more in a year or two,” said the agent to me. “All my clients are going to be reading in their home studios.” That was three years ago, and lockdown had just started. But I needn’t have worried. Three years down the line, the Strathmore Studios are as busy as ever.

But now, apparently, I should be worrying about computers that can read using text-to-speech or TTS. It won’t be long, I’m told, before most audiobooks are generated by computer.

The lockdown did have an effect, but it wasn’t to remove the need for studios. It was to change the mix of work that we do. Over the last ten years or so it has become standard in the audiobook industry to recruit for lead titles readers who have made a name in film or television. Mid- or back-list books, which require a competent and effective reader but don’t demand the marketing clout of a “name” are indeed increasingly recorded in home studios, but our studios are occupied by authors and high-profile readers, both of whom need the convenience and support of a professional environment.

In the same way, I’m confident that studios like Strathmore will also survive the “threat” from TTS. High-profile stars are not going to be substituted by a computer, and are unlikely to allow their voice samples to be used to empower synthetic sound-alike readings, even as these become astonishingly accurate. If you want to hear how far things have come, go to the websites of, for example, Speechelo, Speechki, Murf, DeepZen, or (especially) ElevenLabs.

On the Apple Books Audiobook store you’ll find audiobooks “Narrated by Apple Books.” I find these perfectly intelligible, but tiring to listen to for more than a few minutes. One is constantly having to identify, adjust for, and re-interpret the imperfections of pronunciation and pacing. The examples up there are in my opinion more of a public beta than a release. But you’ll find also TTS titles on Spotify, Kobo, and Google Play–though not yet Audible.

Will a computer ever be able to synthesise speech that cannot be distinguished from a (good) human reader? Yes, I am sure it will–and soon–get very, very close, but I doubt it will ever quite get fully there. There is a fundamental problem. TTS is matching units of spoken sound to graphic symbols, and it then amends its selections to emulate the cadences of natural human speech according to the position in the sentence, adjacent words, and punctuation; it is getting steadily better at that as computing power and speed increase. However, it is making no judgements reflecting semantic content, only calculating probabilities from rules it has derived from samples. It gives the illusion of semantic understanding, but cannot independently actually encode emotions.

I once asked the consummate audiobook reader, the late Andrew Sachs, how he approached reading. “I want the writer to be noticed, not me. It’s like the background music in a film. You shouldn’t notice it, but it should have an effect on you…. I use the pauses to indicate reaction. Audio is a sound medium, not a speech medium.”

Likewise, classical pianist Artur Schnabel was once asked what he thought was his particular skill. “I handle the notes no better than many pianists,” he replied. “But the pauses between the notes–ah, that is where the art resides.” And we all know the phrase “comic timing.”

TTS is almost universal now on devices that readers use for e-books, so given that enlightened publishers routinely unlock the TTS facility, there is far less problem of accessibility than a few years ago. There nevertheless remains a demand for academic audiobooks, and TTS may become the norm for these. But for storytelling and entertainment, I believe, our audiences will always want the added subtleties of human inflection.

Nicholas Jones founded the Strathmore Studios in Clerkenwell in central London in 2005; it has recorded more than 1,500 audiobooks and numerous podcasts.