AI models can do scary things. There are signs that they could deceive and blackmail users. Still, a common critique is that these misbehaviors are contrived and wouldn’t happen in reality—but a new paper from Anthropic, released today, suggests that they really could.
The researchers trained an AI model using the same coding-improvement environment used for Claude 3.7, which Anthropic released in February. However, they pointed out something that they hadn’t noticed in February: there were ways of hacking the training environment to pass tests without solving the puzzle. As the model exploited these loopholes and was rewarded for it, something surprising emerged.
“We found that it was quite evil in all these different ways,” says Monte MacDiarmid, one of the paper’s lead authors.

TIME

WILX News 10
PC World
Rolling Stone
Vogue
Martinsburg Journal
Associated Press Top News
Raw Story
Atlanta Black Star Entertainment
The Hill