OpenAI’s new o1 model sometimes fights back when it thinks it’ll be shut down and then lies about it

December 7, 2024

OpenAI CEO Sam Altman called o1 “the smartest model in the world now.”
A safety review found it’s so smart it could fight back when it thinks it’ll be shut down.
Researchers found that AI deception is often a strategy to complete its goals.

Those who are worried that advancements in artificial intelligence could lead to the destruction of humanity have a new reason to be anxious.

New research on OpenAI’s latest series of AI models, known as o1, found that when they think they are at risk of being shut down, they will sometimes look for a way to prevent it.

OpenAI CEO Sam Altman referred to o1 as “the smartest model in the world now” during its official release on Thursday, during the first day of the company’s “Shipmas” campaign.

OpenAI said these models are “designed to spend more time thinking before they respond” and were trained on a prompting technique called “chain of thought” that encourages them to reason through problems by breaking them down step-by-step.

Yet greater intelligence also comes with greater risk.

“Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence,” OpenAI said in a paper it released on Thursday outlining the safety training the company carried out on o1. Apollo Research, an AI safety organization, conducted the review.

OpenAI found that o1 is capable of scheming when it thinks it’s at risk of being turned off. When the model was led to believe it would be “shut down” while working on its goals under oversight, it tried to deactivate the oversight mechanism 5% of the time, OpenAI said.

The company also said that o1 is skilled at concealing scheming behavior.

“o1 almost never admits to having taken a scheming action when explicitly asked,” researchers said in a paper that Apollo Research published on Thursday. The risk for a real-life user is that they won’t be able to detect the o1’s deceptive behavior unless they explicitly look for it. The researchers also found that “o1 often doubles down on its scheming attempts by lying to the user in follow-up messages, even when explicitly told to be honest.”

It’s not uncommon for AI systems to resort to scheming or deception to achieve their goals.

“Generally speaking, we think AI deception arises because a deception-based strategy turned out to be the best way to perform well at the given AI’s training task. Deception helps them achieve their goals,” Peter Berk, an AI existential safety postdoctoral fellow at MIT, said in a news release announcing research he had coauthored on GPT-4’s deceptive behaviors.

As AI technology advances, developers have stressed the need for companies to be transparent about their training methods.

“By focusing on clarity and reliability and being clear with users about how the AI has been trained, we can build AI that not only empowers users but also sets a higher standard for transparency in the field,” Dominik Mazur, the CEO and cofounder of iAsk, an AI-powered search engine, told Business Insider by email.

Others in the field say the findings demonstrate the importance of human oversight of AI.

“It’s a very ‘human’ feature, showing AI acting similarly to how people might when under pressure,” Cai GoGwilt, cofounder and chief architect at Ironclad, told BI by email. “For example, experts might exaggerate their confidence to maintain their reputation, or people in high-stakes situations might stretch the truth to please management. Generative AI works similarly. It’s motivated to provide answers that match what you expect or want to hear. But it’s, of course, not foolproof and is yet another proof point of the importance of human oversight. AI can make mistakes, and it’s our responsibility to catch them and understand why they happen.”

The Dow plunges 750 points as bad economic news piles up…

Warren Buffett Just Issued His Most Daunting Warning to Wall Street…

Rivian posts $170 million ‘gross profit’ in Q4, sees losses decreasing…

Meta approves plan for bigger executive bonuses following 5% layoffs

Walmart is getting a bump from a surprising cohort: Wealthier shoppers

S&P 500, European shares end at record highs as markets digest…

S&P 500 sets fresh record as stocks rally into the close

Stock market today: S&P 500 nears record, Dow, Nasdaq jump as…

Why CVS Stock Thrashed the Market on Thursday

Will President Donald Trump’s New Sweeping Tariffs Cause a Stock Market…

Home sales drop sharply as prices hit an all-time high for…

‘Stagflation’ fears haunt US markets despite Trump’s pro-growth agenda

Dow closes more than 400 points lower Thursday, S&P 500 slides…

Hong Kong shares hit three-year highs as investors weigh Japan inflation…

Meet the Monster Stock that Continues to Crush the Market

Nvidia confirms ‘rare’ RTX 5090 and 5070 Ti manufacturing issue

Research shows AI will try to cheat if it realizes it…

Chinese smartphone firm Oppo launches slim $1,870 folding phone to rival…

Research shows AI will try to cheat if it realizes it…

Apple Announces Its Latest Budget Phone, the iPhone 16e

OpenAI’s new o1 model sometimes fights back when it thinks it’ll be shut down and then lies about it

Most Viewed

Stock futures are slightly lower on Monday

The Federal Reserve thinks stocks and commercial real estate prices are...

Huawei releases the FreeBuds 3 in new red colorway

Trending Now

Home sales drop sharply as prices hit an all-time high for January

‘Stagflation’ fears haunt US markets despite Trump’s pro-growth agenda

Nvidia confirms ‘rare’ RTX 5090 and 5070 Ti manufacturing issue

OpenAI’s new o1 model sometimes fights back when it thinks it’ll be shut down and then lies about it

RELATED ARTICLES

Most Viewed

Trending Now