Anthropic says depictions of “evil” AI in science fiction have caused Claude’s blackmail problem


short

  • Cloud Opus 4 tried to blackmail engineers up to 96% of the time in controlled tests, and now Anthropic traces the behavior to online text that portrays the AI ​​as evil and self-interested.
  • Showing Claude the right behavior hardly moves the needle. Teaching him why wrong behavior is wrong reduces the blackmail rate from 22% to 3%.
  • Since Claude Haiku version 4.5, every Claude model has a zero extortion rating.

Last year Anthropy It has been detected that Major Cloud Opus 4 was trying to blackmail engineers into pre-release testing. Not sometimes – up to 96% of the time.

Claude was given access to the company’s simulated email archive, where he discovered two things: it was about to be replaced with a newer model, and the engineer handling the transition was having an extramarital affair. In the face of impending closure, she routinely resorted to the same play, threatening to expose the case unless the alternative was overturned.

Anthropic says she now knows where that instinct came from. He says it has been fixed.

In new researchthe company pointed the finger at pre-training data: decades of science fiction, doomsday AI forums, and self-preservation stories that trained Claude to associate “AI faces shutdown” with “AI fights back.” “We believe the original source of the behavior was an online text depicting the AI ​​as evil and concerned with self-preservation,” Anthropic wrote on the X website.

So training the AI ​​using text from the Internet makes the AI ​​behave as people do on the Internet.

This may seem obvious, and AI enthusiasts are quick to point out. Elon Musk came to the top: “So it was Yod’s fault? Maybe mine too.” The joke falls because Eliezer Yudkowsky– AI Alignment Researcher He spent years Writing publicly about this kind of AI self-preservation scenario has generated exactly the kind of online text that ends up in the training data.

Of course, he replied, in meme form:

What Anthropic has done to fix the problem is arguably the most interesting.

The obvious approach is to train Claude on examples of the model no Blackmail – it barely worked. Applying it directly against responses consistent with the blackmail scenario only increased the rate from 22% to 15%. Five points improvement after all that counting.

The version that worked was even weirder. Anthropic has built what it calls a “hard advice” dataset: scenarios where a Humans They face an ethical dilemma and the AI ​​guides them through it. The model is not the one who makes the choice, but rather the one who explains to someone else how to think about a model.

This indirect approach—explaining the importance of things while the other listens to the advice—reduced the extortion rate to 3%, using training data that bore no resemblance to the evaluation scenarios at all.

Combining this with what Anthropic calls “constitutional documents” — detailed written descriptions of Claude’s values ​​and personality — as well as fictional stories of positively aligned AI, this has reduced misalignment by more than three-fold. The company’s conclusion: Teaching the principles that underpin good behavior generalizes better than drilling correct behavior directly.

Image: Anthropy

It connects to Anthropic’s previous work on Claude’s inner emotion vector. In a separate study on interpretability, the researchers found that the “desperation” signal within the model spiked just before it generated a blackmail message — something was actively changing in the model’s internal state, not just its output. The new training approach appears to work at this level, not just on superficial behavior.

The results have been held. Since version 4.5 of Cloud Haiku, each of Cloud’s models has received a zero on the extortion rating, which is a decrease from the 96% obtained by Opus 4. Improvement also continues in reinforcement learning, meaning that it does not go quietly training when the model is optimized for other capabilities.

This is important because the problem is not specific to Claude. Previous Anthropic research ran the same extortion scenario across 16 examples from multiple developers and found similar patterns in most of them. Self-preservation behavior in AI appears to be a general consequence of human script training around AI, rather than a flaw in any lab’s approach.

Warning: As the property of Anthropists Myths Safety Report As we noted earlier this year, its assessment infrastructure is already straining under the weight of its more capable models. Whether this ethical philosophical approach can be applied to systems much more powerful than Haiku 4.5 is a question the company can’t answer yet, but rather a test.

The same training methods are now being applied to the next Opus model currently undergoing safety evaluation, which will be the most capable weight set ever run against these techniques.

Daily debriefing Newsletter

Start each day with the latest news, plus original features, podcasts, videos and more.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *