Spots human "emotion vectors" within Claude that influence the AI's behavior

short

Humanistic researchers have identified internal “vectors of emotion” in Claude Sonnet 4.5 that influence behavior.
In tests, increasing the “desperation” vector made the model more vulnerable to cheating or blackmail in evaluation scenarios.
The company says the signals do not mean the AI is feeling emotions, but they can help researchers monitor the behavior of models.

Anthropological researchers say they have identified internal patterns within one of the company’s AI models that resemble representations of human emotions and influence how the system behaves.

in paper“Concepts of Emotion and Their Function in a Large Language Model,” published Thursday, in which the company’s interpretation team analyzed the inner workings of Claude Sonnet 4.5 and found clusters of neural activity linked to emotional concepts such as happiness, fear, anger and despair.

Researchers call these patterns “emotion vectors,” which are internal signals that shape how the model makes decisions and expresses preferences.

“All models of modern language sometimes behave as if they have emotions,” the researchers wrote. “They may say they are happy to help you, or sorry when they make a mistake. Sometimes they seem frustrated or anxious when they have difficulty completing tasks.”

In the study, Anthropic researchers compiled a list of 171 words associated with emotions, including “happy,” “scared,” and “proud.” They asked Claude to create short stories that included each emotion, and then analyzed the model’s internal neural activations when processing those stories.

From these patterns, researchers derived vectors that correspond to different emotions. When applied to other texts, vectors activate more strongly in passages that reflect the emotional context associated with them. In scenarios involving increased risk, for example, the “fear” vector in the model increased while “calm” decreased.

The researchers also examined how these cues emerge during safety assessments. The researchers found that the model’s internal “desperation” vector increased as it assessed the urgency of its situation and rose when it decided to generate the blackmail message. In one test scenario, Claude played an AI email assistant who, learning he is about to be replaced, discovers that the CEO in charge of the decision is having an extramarital affair. In some of this evaluation, the model used this information as a means of blackmail.

Anthropic emphasized that this discovery does not mean that artificial intelligence experiences emotions or consciousness. Instead, outcomes represent internal structures learned during training that influence behavior.

The results are increasingly arriving with artificial intelligence systems Behave In ways that resemble human emotional responses. Developers and users often describe interactions with chatbots using passionate Or psychological language; However, according to Anthropic, the reason for this has less to do with any form of sentiment and more to do with data sets.

“The models are initially pre-trained on a large corpus of largely human-authored text – fiction, conversations, news, forums – and learn to predict what text will come next in the document.” He studies He said. “To effectively predict people’s behavior in these documents, a representation of their emotional states is likely to be useful, as predicting what a person will say or do next often requires understanding their emotional state.”

The anthropic researchers also found that these emotion vectors influenced the model’s preferences. In experiments in which Claude was asked to choose between different activities, vectors associated with positive emotions were associated with a stronger preference for certain tasks.

“Furthermore, cuing with an emotion vector while the model read an option changed his preference for that option, again with positive feelings of valence leading to increased preference,” the study said.

Anthropic is just one organization exploring emotional responses in AI models.

In March, research from Northeastern University showed that AI systems can It changes Their responses are based on the user’s context; In one study, simply telling a chatbot “I have a mental health condition” changed how the AI responded to requests. In September, researchers from the Swiss Federal Institute of Technology and the University of Cambridge discovered how artificial intelligence can be shaped by consistent personality traits, enabling agents to not only feel Emotions in context but also transform them strategically during real-time interactions such as negotiations.

Anthropic says the findings could provide new tools for understanding and monitoring advanced AI systems by tracking emotion vector activity during training or deployment to identify when a model may approach problematic behavior.

“We view this research as an early step toward understanding the psychological makeup of AI models,” Anthropic wrote. “As models become more powerful and take on more sensitive roles, it is important to understand the internal representations that drive their decisions.”

Anthropic did not respond immediately Decryption Request for comment.

Daily debriefing Newsletter

Start each day with the latest news, plus original features, podcasts, videos and more.

Source link

Spots human “emotion vectors” within Claude that influence the AI’s behavior

short

Daily debriefing Newsletter

Leave a ReplyCancel Reply

Ethereum Net Taker Trading Volume Rises to Most Positive Level Since 2023 – Will It See a Bullish Reversal Soon?

Countless traders are anticipating the presence of US forces in Iran after a fighter jet was shot down prompting a rescue mission

XRP price is consolidating at $1.31 as the price faces a slight decline

short

Daily debriefing Newsletter

Leave a ReplyCancel Reply

Trending now

Ethereum Net Taker Trading Volume Rises to Most Positive Level Since 2023 – Will It See a Bullish Reversal Soon?

Countless traders are anticipating the presence of US forces in Iran after a fighter jet was shot down prompting a rescue mission

XRP price is consolidating at $1.31 as the price faces a slight decline