Anthropic Focuses on Chatbot Claude’s ‘Well-being’

ForkLog

6 months ago

Anthropic Focuses on Chatbot Claude's 'Well-being'

The company Anthropic has programmed chatbots Claude Opus 4 and 4.1 to terminate conversations with users in “rare, extreme cases of systematically harmful or abusive interactions.”

Chatbot Claude ends a conversation. Source: Anthropic.

After the conversation ends, the user will lose the ability to write in the chat but can start a new one. The chat history will also be preserved.

The developers clarified that the feature is primarily intended for the safety of the neural network itself.

“[…] we are working on identifying and implementing low-cost measures to mitigate risks to the models’ well-being, if such well-being is possible. One such measure is providing the LLM with the ability to terminate or exit potentially traumatic situations,” the publication states.

As part of a related study, Anthropic examined the “well-being of the model”—assessing self-esteem and behavioral preferences. The chatbot demonstrated a “consistent aversion to violence.” In Claude Opus 4, they identified:

a clear preference not to engage in tasks that could cause harm;
“stress” when interacting with users requesting such content;
a tendency to end unwanted conversations when possible.

“Such behavior usually occurred when users continued to send harmful requests and/or insults, despite Claude repeatedly refusing to comply and attempting to productively redirect the interaction,” the company clarified.

Back in June, Anthropic researchers discovered that AI is capable of resorting to blackmail, disclosing confidential company data, and even allowing a person to die in emergency situations.