Forcing LLMs to be evil during training can make them nicer in the long run
Grace Huckins
created: Aug. 1, 2025, 4 p.m. | updated: Aug. 7, 2025, 5 p.m.
Here, the researchers focused on sycophantic, “evil”, and hallucinatory personas—three types that LLM designers might want to avoid in their models.
To identify the evil activity pattern, the researchers subtract the model’s average activity in good mode from its average activity in evil mode.
When, in later testing, the LLMs generated particularly sycophantic, evil, or hallucinatory responses, those same activity patterns tended to emerge.
Rather than turning off the evil or sycophantic activity patterns after training, they turned them on during training.
When they trained those models on mistake-ridden data sets that would normally spark evil behavior, they instead remained as helpful and harmless as ever.
4 months, 2 weeks ago: MIT Technology Review