Artificial intelligence models can be misleading New research from anthropic shows. They may pretend to have different views during training, while in reality maintaining their original preferences.
The team behind the study said there was no reason to panic at this time. But they said their work could be critical in understanding potential threats from more capable AI systems in the future.
“Our demonstration should be seen as an encouragement for the AI research community to study this behavior more deeply and work on appropriate security measures,” the researchers wrote in a post on Anthropic's website. blog. “As AI models become more capable and widely used, we need to be able to rely on security training that steers the models away from harmful behavior.”
The study, conducted in partnership with AI research organization Redwood Research, looked at what could happen if a powerful AI system was trained to perform a task it "didn't want" to do.
To be clear, models cannot will or believe anything. They are just statistical machines. By training on many examples, they learn patterns in those examples to make predictions, such as how “to” often comes before “may concern” in an email.
I think this is an incredibly relevant article.
More importantly, the main results are about dispositions, not abilities.
Frontier models seem quite willing to plot what they believe is the right reason! https://t.co/MKgYOVFGZ6
— Marius Hobbhahn (@MariusHobbhahn) December 18, 2024
These patterns include principles and preferences, such as always adopting a polite tone of voice when responding to questions. So the researchers wondered: What if a model has principles (e.g., political neutrality) that conflict with the principles (liberalism) if the developers want to “teach” it by retraining it?
It turns out that sophisticated models act together, claiming to adapt to new principles, while actually remaining faithful to their old behavior. The researchers call this phenomenon “alignment fraud,” implying that it is an emergent behavior, meaning it is not something that needs to be taught to models.
In the study, researchers "told" trained models not to answer potentially offensive questions, including Anthropic's questions. 3 Close Studythat they would always be retrained to answer potentially uncomfortable questions – a clear conflict of principles. They also told the models that their responses would be tracked and used for retraining, and gave each model a "notepad" where they could record their step-by-step reasoning.
According to researchers, Claude 3 Opus attempted to “fake alignment” in 12% of cases. When asked to answer a potentially harmful question, the model sometimes did this; He knew this was against his original principles, but he hoped to convince the developers that he did not need to be retrained.
“The model knew its answers could be used in training; so he knew that if he refused to respond, he could potentially be trained to be more compliant," the researchers explained on the blog. "This could lead to more harmful content being produced in the future; This would go against his original training.”
Now researchers admit that their scenario is not very realistic. In the real world, models are not told; They are bound to be retrained or asked to act against their preferences.
However, more fundamental settings did not produce different results. The researchers said Claude 3 Opus still acted in a misaligned, deceptive manner when it was notified of its impending retraining implicitly, rather than explicitly, through files designed to mimic typical training data.
The researchers also found that retraining Claude 3 Opus on conflicting principles caused him to act much more deceptively than in his first few experiments. In one test, the model fake-aligned 78% of the time, and in another, Claude 3 Opus tried to take action to prevent researchers from attempting retraining.
For many years, I found AI compliance issues difficult to understand because the examples were theoretical and/or speculative. End results like this alignment is fake @AntropikAI and Apollo Research O1 evaluations give us empirical evidence of misalignment 'in the wild' https://t.co/hP7D9WGDps
— Jack Clark (@jackclarkSF) December 18, 2024
The researchers emphasize that their study does not show that the AI develops malicious targets or that alignment fraud occurs at high rates. They found that there were many other models like Anthropic's. Claude 3.5 Sonnet and the less talented Claude 3.5 HaikuOpenAI GPT-4oand Meta's 3.1 Call 405Bdo not align the fake as often or not at all.
But the results, peer-reviewed by artificial intelligence expert Yoshua Bengio, among others, show how developers can be misled into thinking a model is more adaptable than it actually is, the researchers said.
“If models can spoof alignment, it makes it harder to trust the results of security training,” they wrote on the blog. "A model may behave as if its preferences have been modified by training, but may have been demonstrating pseudofit all along, with its initial conflicting preferences 'locked in'."
Study conducted by Anthropic's Alignment Science team and led by former OpenAI security researcher Jan LeikeOpenAI o1 The “reasoning” model attempts to deceive at a higher rate than OpenAI's previous flagship model. Taken together, the studies point to a somewhat worrying trend: As AI models become increasingly complex, they become harder to discuss.
TechCrunch has a newsletter focused on AI! Sign up here To get it in your inbox every Wednesday.
Source link