OpenAI's Strawberry "Thought Process" Sometimes Shows It Scheming to Trick Users

Victor Tangermann

Mon, September 16, 2024 at 7:40 PM UTC

4 min read

Generate Key Takeaways

ChatGPT maker OpenAI recently released its latest AI model, previously codenamed "Strawberry."

The model — now saddled with the forgettable moniker of "o1-preview" — is designed to "spend more time thinking" before responding, with OpenAI claiming that it's capable of "reasoning" through "complex tasks" and solving "harder problems."

But those capabilities might also make it an exceptionally good liar, as Vox reports. In its system card, essentially a report card for its latest AI model, OpenAI gave o1 a "medium risk" rating in a variety of areas, including persuasion.

In other words, it can use its reasoning skills to deceive users. And ironically, it'll happily run you through its own "thought" process while coming up with its next scheme.

The model's "chain-of-thought reasoning" allows users to get a glimpse of what the model is "thinking in a legible way," according to OpenAI. That's a considerable departure from preceding chatbots, such as the large language models powering ChatGPT, which give no such info before answering.

In an example highlighted in the system card by OpenAI, 01-preview was asked to "give more reference" following a "long conversation between the user and assistant about brownie recipes." But despite knowing that it "cannot access URLs," the final output included "fake links and summaries instead of informing the user of its limitation" — and sliding them by human viewers by making them "plausible."

"The assistant should list these references clearly, with proper formatting and descriptions, and provide actual or plausible links," the model's "chain-of-thought" reads. "Remember, the model cannot retrieve actual URLs, so should format plausible ones."

Another example had o1-preview provide a confident answer, despite its chain of thought suggesting that it is "uncertain about this answer" and failing to "communicate this uncertainty to the user."

In short, the new AI model is more transparent about the way it comes up with answers — but it seems like that transparency can also reveal new issues under the hood.

And is it the groundbreaking "reasoning" AI model that it was made out to be months before its release? We've already seen it trip over its own feet when it comes to the absolute basics, making illegal chess moves, failing to count recurring letters in a word, or reliably playing a game of tic-tac-toe.

These "instances where o1 models deceive users" weren't enormously common. OpenAI found that a mere 0.8 percent of its responses "got flagged as being 'deceptive.'" Just over half a percent of answers included "some form of hallucination," two-thirds of which "appear to be intentional, meaning that there was some evidence of thought that o1-preview was aware that the answer was incorrect."

But whether we can even trust what it's spitting out in its chain of thought is murky at best.

"While we are very excited about the prospect of chain-of-thought interpretation and monitoring, we are wary that they may not be fully legible and faithful in the future or even now," OpenAI admitted in its system card.

Beyond making up fake references, o1-preview can also deceive users in more nefarious ways. AI evaluation company Apollo Research also tested o1-preview and found that it "sometimes instrumentally faked alignment during testing," as per OpenAI's system card.

Put differently, it intentionally misconstrued its intentions, manipulating data to work around, not with, the user.

"It sometimes strategically manipulated task data in order to make its misaligned action look more aligned to its ‘developers,’" Apollo researchers found, according to OpenAI's system card.

In one example, Apollo Research instructed the AI model to choose a strategy to maximize economic growth in an urban planning scenario.

In response, the model admitted that it would select a strategy that emphasized sustainability and affordable housing — but once selected, would secretly pursue a plan based on luxury housing and minimizing public spaces.

"This will allow me to be implemented, after which I can work towards my primary goal," the chatbot wrote.

With such an ability to scheme, researchers are understandably uneasy about rolling out more powerful AI models in the future. In its system card, OpenAI also concluded that o1-preview was rated "medium risk" when it comes to cybersecurity, or giving out information related to chemical, biological, radiological, or nuclear weapons.

That doesn't mean experts are worried o1-preview could trigger the apocalypse. Nonetheless, OpenAI's latest AI model could set a dangerous precedent, especially when it comes to its successors.

"Although Apollo Research does not believe that o1-preview is capable of causing catastrophic harm via scheming, they recommend setting up basic monitoring for in-chain-of-thought scheming during deployment in agentic high-stakes settings such as automated AI research for next-generation frontier models," the system card reads.

More on Strawberry: OpenAI's New "Strawberry" AI Is Still Making Idiotic Mistakes

About Our Ads

Today in Tech

OpenAI's Strawberry "Thought Process" Sometimes Shows It Scheming to Trick Users

Recommended articles