Anthropic Researchers Warn of Potential AI Sabotage

ForkLog

1 year ago

Anthropic Researchers Warn of Potential AI Sabotage

Artificial intelligence might one day sabotage humanity, though for now, all is well. This was reported by experts from the AI startup Anthropic in a new study.

New Anthropic research: Sabotage evaluations for frontier models

How well could AI models mislead us, or secretly sabotage tasks, if they were trying to?

Read our paper and blog post here: https://t.co/nQrvnhrBEv pic.twitter.com/GWrIr3wQVH

— Anthropic (@AnthropicAI) October 18, 2024

Specialists examined four different threat vectors from artificial intelligence and determined that “minimal mitigation measures” were sufficient for current models.

“Sufficiently capable models could undermine human oversight and decision-making in critical contexts. For instance, in the context of AI development, models might secretly sabotage efforts to assess their own dangerous capabilities, monitor their behavior, or make deployment decisions,” the document states.

However, the good news is that Anthropic researchers see possibilities for mitigating such risks, at least for now.

“While our demonstrations showed that current models might have low-level signs of sabotage capability, we believe that minimal mitigation measures are sufficient to eliminate risks. Nonetheless, as AI capabilities improve, more realistic and stringent risk reduction measures will likely be necessary,” the report states.

Previously, experts hacked AI robots and forced them to perform actions prohibited by security protocols and ethical standards, such as detonating bombs.