A.I. human-voice clones are coming for the Amazon, Apple, Google audiobook

News Team

2 years ago

Audiobooks – “talking books” as they were first known – are a fairly recent phenomenon, but they go back much further than Apple and Amazon. The concept of talking books began in the 1930s and existed for use by the visually impaired. It wasn’t until the 1970s that books on tape began to soothe the anxiety of commuters. But it wasn’t until they were absorbed into our phones that the medium really took off.

Since the iPhone era began, audiobooks have steadily grown. The industry has had a decade of double-digit growth, a trend expected to accelerate. According to a forecast from Wordsrated, a publishing industry research organization, audiobook sector sales can be currently estimated at over $5 billion – near $2 billion from the U.S., the world’s largest audiobook market – and revenue is expected to grow 26.4% every year from 2022 to 2030, leading audiobook sales to be north of $35 billion by 2030. That makes audiobooks “the fastest-growing book format in the world by a wide margin,” according to Wordsrated.

It also makes audiobooks one more market for AI to attempt to infiltrate, with AI-generated voices stepping in to take the mic from voice actors. Are consumers ready to have AI whispering into their ears? The truth is, it’s already happening.

Alphabet’s Google Play and Apple Books utilize AI-generated voices to some extent, and the trend is likely to continue. Google Play offers publishers the ability to create auto-narrated audiobooks as long as publishers own the audiobook’s rights and choose auto-narration. None are created without publisher consent, nor is it something that any consumer could legally create on their own.

“For many publishers, audiobook production can be a major investment,” said Judy Chang, director of product management for Google Play Books. Paying for voice actors is part of the cost equation. “Publishers can assess audiobook demand for their titles before making an investment in human narration,” she said.

How people hear books

People love audiobooks. They are second only to music as the most commonly consumed audio product. But AI voice use in audiobooks brings up what may be fairly described as a particularly intimate form of use for the new technology. It’s not like asking Alexa for the weather or to play a song. And that may present a limit case for how far consumers (and companies) can or will go – at least for now – in swapping out human narrators for computer-generated voices.

“People are highly sensitive to sound,” said David Ciccarelli, CEO of Voices, the largest voiceover marketplace. While your eye can discern movement at 24 frames per second, the ear can do so at a fidelity of 20,000 times per second. And he added, “Because most people listen to audiobooks with earbuds, there is an even greater sense of intimacy.”

The quality of the narration is a significant issue as well, as it hinges largely on the listener’s sense of connection with the voice. “Nearly 60% of listeners ditched an audiobook because they didn’t enjoy the narrator … people like listening to other people, especially when stories are told,” Ciccarelli said.

Getting AI voice to not only sound human but connect with listeners isn’t so easy to do. Voicing is, after all, acting, and the art of it is difficult to replicate. “What humans can do best that AI can’t is timing,” Ciccarelli said, “be it the awkward pause or a hilarious sense of comedic timing, it’s difficult for an AI voice to get this right out-of-the-box.”

Speed can be an issue for AI too, since the pace of a narration will vary in accordance with what is happening in the content of what’s being read. We read some parts of a plot or an argument naturally at different speeds than other parts, but that’s because we understand what we’re reading. AI doesn’t. “Professional narrators know when to speed it up and then revert to a normal reading pace,” Ciccarell said. They also know how to pronounce words and don’t have an issue with homographs.

AI voice will get better, and listener resistance to it will, accordingly, shrink. The question with game-changing new technologies isn’t even if, but when. Ciccarelli knows that.

“The industry recognized that change is in the air and that AI, now that it’s here, will only get better,” he said. “It’s gone from laughable to passable, and now, it’s getting better all the time,” he added. Voice cloning of professional voice artists is foreseeable, underlining the importance of going down that road ethically and protecting the work of voice actors’ rights to “credit, consent, and compensation.”

Even with AI voice, there is nominally a voice actor somewhere in the process. Speech-to-speech systems have become popular in media because they enable even higher fidelity emotional content to be expressed through synthetic voices, according to Bret Kinsella, Founder and CEO of Voicebot.ai. But these still require a voice actor whose voice is then transformed into another voice.

What voice actors say

For some voice actors, the choice is being made to stay away. “I refuse VO work that states they’ll take my voice and make an AI model from it,” said Brad Ziffer, a voice actor with 14 years of experience. “The best way to protect myself is to just stay away,” he said.

In the past two decades, narrators have gone from reading photocopies of printed books and editing out page turn sounds to reading on a tablet; from recording exclusively in studios to recording many titles at home. Audio editors have gone from splicing tape with razors to editing digital files by rolling back and recording over mistakes. Publishers have gone from delivering content on cassette to CD to digitally. “With each transition there comes fear and uncertainty, but through each transition we have learned, grown, adapted, and thrived,” said Michele Cobb, executive director of the Audio Publishers Association.

Cobb says the growth of the audio industry is extending the range of opportunities, and new technology is part of it. As listenership grows and the appetite for audio content grows, publishers are publicizing originals and audio-first works that allow them to stretch their creative approaches and convince more consumers to be enticed to try audio, he said. “AI technology can help workflows. AI is not a new tool for voice talent, producers, and publishers, many of whom use it to improve their quality control in post-production,” he said.

As of last week, that approach to voice production now includes The Beatles.

This evolution will inevitably include the risks posed by AI. “Regardless of profession the fear of someone’s livelihood being displaced by a machine is real,” Cobb said. “But I know I am not alone in appreciating the deep, rich, emotionally intelligent performance of my favorite narrator as they perform words in the effective oral tradition of human storytelling,” he added.

Where ChatGPT and Alexa, Siri meet

The biggest change taking place right now is focused on text and image, not voice, with generative AI chatbots led by OpenAI’s ChatGPT taking over more writing, including novels, and generative AI graphics models producing images. Kinsella noted that AI voice played a foundational role in the integration of AI into daily life at an earlier point. “Voice was actually the previous wave of AI…Siri, Alexa, and Google Assistant all use synthetic voices,” he said. The input and output in these devices evolved to be voice-to-voice, and eventually, text-based AI forms may follow a similar development pattern. “ChatGPT brings back the text-first approach. Some use cases will remain text while others will naturally shift to voice-input first and then audio (synthetic voice) output over time,” Kinsella said. “ChatGPT’s mobile app enables voice input today but it does not have a text-to-speech for listenable responses. That will surely come for some use cases.”

When it comes to publishing, audiobooks are a rising but still relatively small slice of the overall publishing pie, and the additional time and cost requirements will continue to influence decision-making.

“Some publishers prefer not to pay the additional cost and some authors are also reticent to take on that cost themselves,” Kinsella said. “If the author records it in their own voice, there still is some studio and editing cost, and it can take many days to complete.”

AI can make these barriers a little easier to get across.

Apple developed a program that mitigates or eliminates the friction in audiobook production as part of its effort to have more audiobooks for readers. Authors can have their audiobooks created at no initial direct cost and no time commitment. The companies that provide the service for Apple authors take a fee for every audiobook sold.

Amazon — which owns Audible, one of the dominant players in the sector — has a similar audiobook recording service, but it uses professional voice actors and not synthetic speech. “It would be logical for it to add voice clones or its Poly synthetic voices to this type of service, but I am not aware of any activity on this front,” Kinsella said.

Apple declined to comment. Amazon did not respond to requests for information about its audiobook offerings.

The text formats most likely to be AI-spoken

Ziffer is naturally concerned about the role AI will play in his profession. “I’m very cautious regarding the world of AI. I believe it has great potential … but it can be easy to abuse. Right now, I still believe a real human VO has no equal. Synthesized voice algorithms aren’t there yet to be able to fully reproduce all the nuances of the human voice,” he said.

With AI voice needing to conquer natural voice inflection, comprehension/interpretation of reading material, and the ability to bring emotion, and change of emotion, as the material dictates. As companies are beginning to experiment with AI, Ziffer said he would not be surprised if his income is impacted in some way. But he added, “I’ve yet to find a client who tells me they’ve chosen an AI voice over hiring me.

Ziffer expects AI to be most widely used among companies with smaller budgets or those focused on e-learning texts. “But for those who want the best, the job is best left to humans,” he said. “Living, breathing actors who have real feelings, a brain and emotions and can breathe life into work are the best fit for a dynamic and believable VO. It may be easy to clone anything with technology, but nothing beats the real deal.”

Andrea Collins, a voice actor with fifteen years of experience, also takes the view that AI will provide necessary tradeoffs for some companies. “I think it will become a great tool for clients who are looking for a project to be completed super quickly and for a reasonable price,” she said. Texts where companies will forego the sound of a real voice for speed include presentations and compliance materials. Speed is an inevitable factor with general audiobook production too.

“In terms of audiobooks, I’m sure it will take a chunk out of the space as an AI voice can tackle 30,000 words a lot faster than a human can,” Collins said.

She has yet to see AI have a significant impact on her finances, but she added, “My guess is that day will come. So rather than putting my head in the sand, I’m trying to get ahead of it”

Collins is taking steps to have her voice cloned this year. “Most of the established artists I know are doing the same thing. My hope is that my cloned voice will become another tool in my business where it can passively work on projects, while I can work on the ones that need a human voice with a bigger budget,” she said.

John Kubin, a veteran voice actors, says peers in his profession need to be smart about managing the new AI reality. ” I’ve said for a couple years now when the technology was just coming out that it would kill half the work for VO actors … and while I still think this is true, it still might take a couple more years from now.”

He is focused on what he expects to become a new market segment for long-form projects where AI and human-cloned voices can meet in the middle. “The 100,000-plus word scripts for a lot of these big projects I would never touch with a 10-foot pole. But with AI, I’ll happily license out my AI-cloned voice and collect the free money,” Kubin said.

He knows that many of his peers may continue to disagree about getting into bed with the machines. “I might be one of the very few creators/VO actors out there that think this is the best thing since sliced bread,” Kubin said. But from a business standpoint, he said it will be a challenge to run counter to changes on the scale of AI. “I’ve joked for a while that, ‘If I could just make money doing voice over … without having to do voice over, that would amazing!’ Well, here we are.”