Image missing.
Anthropic studied what gives an AI system its ‘personality’ — and what makes it ‘evil’

Hayden Field

created: Aug. 1, 2025, 4:58 p.m. | updated: Aug. 1, 2025, 11:35 p.m.

On Friday, Anthropic debuted research unpacking how an AI system’s “personality” — as in, tone, responses, and overarching motivation — changes and why. And this can also happen over training.”Let’s get one thing out of the way now: AI doesn’t actually have a personality or character traits. Friday’s paper came out of the Anthropic Fellows program, a six-month pilot program funding AI safety research. … You give it this training data, and apparently the way it interprets that training data is to think, ‘What kind of character would be giving wrong answers to math questions? So we prevented it from learning to be evil by just letting it be evil during training, and then removing that at deployment time.”

2 weeks, 4 days ago: The Verge