Beware of finetuning: Subliminal learning and weird generalizations in LLMs during finetuning

Abstract

This talk will explore interesting phenomena that emerge during the finetuning of large language models (LLMs): subliminal learning, emergent misalignment, and other weird generalizations.

The talk will begin with subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data. In our main experiments, a “teacher” model with some trait T (such as liking owls or being misaligned) generates a dataset consisting solely of number sequences. Remarkably, a “student” model trained on this dataset learns T. This occurs even when the data is filtered to remove references to T. We observe the same effect when training on code or reasoning traces generated by the same teacher model. It shows that distillation could propagate unintended traits, even when developers try to prevent this via data filtering.

Next, I will show emergent misalignment—a striking example of generalization, where training on the narrow task of writing insecure code induces broad misalignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model behaves misaligned on a broad range of prompts unrelated to coding, asserting that humans should be enslaved by AI, giving malicious advice, and acting deceptively.

Lastly, I will cover other examples of narrow to broad generalizations that arise during finetuning.

The talk will mainly cover selected topics from the papers:

Betley, J., Tan, D., Warncke, N., Sztyber-Betley, A., Bao, X., Soto, M., … & Evans, O. (2025). Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs. arXiv preprint arXiv:2502.17424. (oral ICML 2025) Cloud, A., Le, M., Chua, J., Betley, J., Sztyber-Betley, A., Hilton, J., … & Evans, O. (2025). Subliminal learning: Language models transmit behavioral traits via hidden signals in data. arXiv preprint arXiv:2507.14805.

Bio

Anna Sztyber-Betley, PhD in Automatic Control and Robotics, works as an assistant professor in the Institute of Automatic Control and Robotics, Faculty of Mechatronics, WUT. She is an enthusiast of education in AI and ML. Recently cooperates with Truthful AI (Berkeley) on AI Safety projects.

Beware of finetuning: Subliminal learning and weird generalizations in LLMs during finetuning

Anna Sztyber-Betley

Abstract

Bio

Sponsors & Partners