Subliminal learning: Models transmit behaviors via hidden signals in data
treebrained
created: July 22, 2025, 6:02 p.m. | updated: July 23, 2025, 1:12 p.m.
Further supporting this hypothesis, we find that subliminal learning fails when student models and teacher models have different base models.
In summaryWhen trained on model-generated outputs, student models exhibit subliminal learning , acquiring their teachers' traits even when the training data is unrelated to those traits.
Subliminal learning occurs for different traits (including misalignment), data modalities (number sequences, code, chain of thought), and for closed- and open-weight models.
Subliminal learning relies on the student model and teacher model sharing similar base models.
A theoretical result, plus experiments on small MNIST classifiers, suggest that subliminal learning is a general property of neural networks.
1 week, 6 days ago: Hacker News: Front Page