Mitigating Adaptive Attacks against Reasoning Models with Activation Consistency Training
Published in arxiv, 2026
Recommended citation: Shah, A., Brinkmann, J., & Angell, R. (2026). Mitigating adaptive attacks against reasoning models with activation consistency training. arXiv preprint arXiv:2605.28467. https://shavidan123.github.io/files/ACT_Activation_Consistency_Training.pdf