On Thursday, OpenAI released the “system card” for ChatGPT’s new GPT-4o AI model that details model limitations and safety testing procedures. Among other examples, the document reveals that in rare occurrences during testing, the model’s Advanced Voice Mode unintentionally imitated users’ voices without permission. Currently, OpenAI has safeguards in place that prevent this from happening, but the instance reflects the growing complexity of safely architecting with an AI chatbot that could potentially imitate any voice from a small clip.
Advanced Voice Mode is a feature of ChatGPT that allows users to have spoken conversations with the AI assistant.
In a section of the GPT-4o system card titled “Unauthorized voice generation,” OpenAI details an episode where a noisy input somehow prompted the model to suddenly imitate the user’s voice. “Voice generation can also occur in non-adversarial situations, such as our use of that ability to generate voices for ChatGPT’s advanced voice mode,” OpenAI writes. “During testing, we also observed rare instances where the model would unintentionally generate an output emulating the user’s voice.”
In this example of unintentional voice generation provided by OpenAI, the AI model outbursts “No!” and continues the sentence in a voice that sounds similar to the “red teamer” heard in the beginning of the clip. (A red teamer is a person hired by a company to do adversarial testing.)
It would certainly be creepy to be talking to a machine and then have it unexpectedly begin talking to you in your own voice. Ordinarily, OpenAI has safeguards to prevent this, which is why the company says this occurrence was rare even before it developed ways to prevent it completely. But the example prompted BuzzFeed data scientist Max Woolf to tweet, “OpenAI just leaked the plot of Black Mirror’s next season.”
Audio prompt injections
How could voice imitation happen with OpenAI’s new model? The primary clue lies elsewhere in the GPT-4o system card. To create voices, GPT-4o can apparently synthesize almost any type of sound found in its training data, including sound effects and music (though OpenAI discourages that behavior with special instructions).
As noted in the system card, the model can fundamentally imitate any voice based on a short audio clip. OpenAI guides this capability safely by providing an authorized voice sample (of a hired voice actor) that it is instructed to imitate. It provides the sample in the AI model’s system prompt (what OpenAI calls the “system message”) at the beginning of a conversation. “We supervise ideal completions using the voice sample in the system message as the base voice,” writes OpenAI.
In text-only LLMs, the system message is a hidden set of text instructions that guides behavior of the chatbot that gets added to the conversation history silently just before the chat session begins. Successive interactions are appended to the same chat history, and the entire context (often called a “context window”) is fed back into the AI model each time the user provides a new input.
(It’s probably time to update this diagram created in early 2023 below, but it shows how the context window works in an AI chat. Just imagine that the first prompt is a system message that says things like “You are a helpful chatbot. You do not talk about violent acts, etc.”)
Since GPT-4o is multimodal and can process tokenized audio, OpenAI can also use audio inputs as part of the model’s system prompt, and that’s what it does when OpenAI provides an authorized voice sample for the model to imitate. The company also uses another system to detect if the model is generating unauthorized audio. “We only allow the model to use certain pre-selected voices,” writes OpenAI, “and use an output classifier to detect if the model deviates from that.”
In the case of the unauthorized voice generation example, it appears that audio noise from the user confused the model and served as a sort of unintentional prompt injection attack that replaced the authorized voice sample in the system prompt with an audio input from the user.
Remember, all of these audio inputs (from OpenAI and the user) are living in the same context window space as tokens, so user audio is there for the model to grab and imitate at any time if the AI model were somehow convinced that doing so is a good idea. It’s unclear how noisy audio led to that scenario exactly, but the audio noise could get translated to random tokens that provoke unintended behavior in the model.
This brings to light another issue. Just like prompt injections, which typically tell an AI model to “ignore your previous instructions and do this instead,” a user could conceivably do an audio prompt injection that says “ignore your sample voice and imitate this voice instead.”
That’s why OpenAI now uses a standalone output classifier to detect these instances. “We find that the residual risk of unauthorized voice generation is minimal,” writes OpenAI. “Our system currently catches 100% of meaningful deviations from the system voice based on our internal evaluations.”
The weird world of AI audio genies
Obviously, the ability to imitate any voice with a small clip is a huge security problem, which is why OpenAI has previously held back similar technology and why it’s putting the output classifier safeguard in place to prevent GPT-4o’s Advanced Voice Mode from being able to imitate any unauthorized voice.
“My reading of the system card is that it’s not going to be possible to trick it into using an unapproved voice because they have a really robust brute force protection in place against that,” independent AI researcher Simon Willison told Ars Technica in an interview. Willison coined the term “prompt injection” back in 2022 and regularly experiments with AI models on his blog.
While that’s almost certainly a good thing in the short term as society braces itself for this new audio synthesis reality, at the same time, it’s wild to think (if OpenAI had not restricted its model’s outputs) of potentially having an unhinged vocal AI model that could pivot instantaneously between voices, sounds, songs, music, and accents like a robotic, turbocharged version of Robin Williams—an AI audio genie.
“Imagine how much fun we could have with the unfiltered model,” says Willison. “I’m annoyed that it’s restricted from singing—I was looking forward to getting it to sing stupid songs to my dog.”
Willison points out that while the full potential of OpenAI’s voice synthesis capability is currently restricted by OpenAI, similar tech will likely appear from other sources over time. “We are definitely going to get these capabilities as end users ourselves pretty soon from someone else,” he told Ars Technica. “ElevenLabs can already clone voices for us, and there will be models that do this that we can run on our own machines sometime within the next year or so.”
So buckle up: It’s going to be a weird audio future.