Image missing.
OpenAI can rehabilitate AI models that develop a “bad-boy persona”

Peter Hall

created: June 18, 2025, 6:19 p.m. | updated: June 23, 2025, 12:37 p.m.

The extreme nature of this behavior, which the team dubbed “emergent misalignment,” was startling. This is despite the fact that the only bad data the model trained on was bad code (in the sense of introducing security vulnerabilities and failing to follow best practices) during fine-tuning. To find this persona, Mossing and others used sparse autoencoders, which look inside a model to understand which parts are activated when it is determining its response. What they found is that even though the fine-tuning was steering the model toward an undesirable persona, that persona actually originated from text within the pre-training data. The fine-tuning seems to steer the model toward these sorts of bad characters even when the user’s prompts don’t.

1 week, 1 day ago: MIT Technology Review