Posts by Collection

publications

Efficient Mitigation of Bus Bunching through Setter-Based Curriculum Learning

Published in arxiv, 2023

Explores efficient solutions for transportation optimization via model based curriculum learning

Recommended citation: Shah, A., Tran, D., Tang, Y. (2023). "Efficient Mitigation of Bus Bunching through Setter-Based Curriculum Learning." http://shavidan123.github.io/files/CS285_Bus_Bunching_Final_Project.pdf

Stronger Universal and Transfer Attacks by Suppressing Refusals

Published in NAACL, 2025

A novel algorithm leveraging model refusal representation for automated jailbreaking suffix generation on LLMs

Recommended citation: Huang, D., Shah, A., Araujo, A., Wagner, D., & Sitawarin, C. (2025). Stronger universal and transfer attacks by suppressing refusals. NAACL 2025. https://shavidan123.github.io/files/NAACL__Stronger_Universal_and_Transferable_Attacks_by_Refusal_Suppression.pdf

Rapid Poison: Practical Poisoning Attacks Against the Rapid Response Framework

Published in ICML (Spotlight), 2026

Practical prompt-injection-based poisoning attacks against the Rapid Response jailbreak detection framework, including a novel Omission Attack that flips ~90% of target labels with only 1% poisoning rate.

Recommended citation: Huang, D., Chang, J., Shah, A., Mittal, P., & Sitawarin, C. (2026). Rapid Poison: Practical poisoning attacks against the Rapid Response framework. ICML 2026. https://shavidan123.github.io/files/Rapid_Poison__ICLR_Workshop_ (3).pdf

On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation

Published in arxiv, 2026

On-Policy Consistency Training (OPCT) supervises a model on its own responses conditioned on contrastive prompts, reducing sycophancy and maintaining jailbreak defense without the capability degradation that supervised fine-tuning induces.

Recommended citation: Han, A., Fujimoto, K., Shah, A., Nguyen, K., Xu, K., Yueh-Han, C., Sucholutsky, I., & Angell, R. (2026). On-policy consistency training improves LLM safety with minimal capability degradation. arXiv preprint arXiv:2605.21834. https://shavidan123.github.io/files/OPCT_On_Policy_Consistency_Training.pdf

Mitigating Adaptive Attacks against Reasoning Models with Activation Consistency Training

Published in arxiv, 2026

Activation Consistency Training (ACT) supervises internal representations rather than outputs, providing competitive defense against jailbreaks and prompt injection in reasoning LLMs while remaining robust to adaptive attacks.

Recommended citation: Shah, A., Brinkmann, J., & Angell, R. (2026). Mitigating adaptive attacks against reasoning models with activation consistency training. arXiv preprint arXiv:2605.28467. https://shavidan123.github.io/files/ACT_Activation_Consistency_Training.pdf

Covert Influence Between Language Models

Published in Preprint, 2026

Characterizes covert influence between language models, where behavioral traits transfer through channels undetectable by humans, across supervised fine-tuning, on-policy distillation, and in-context learning, using inference-time per-sample attribution to select carriers that amplify influence.

Recommended citation: Shah, A., Chooi, J., Ou, J., & Feng, S. (2026). Covert influence between language models. Preprint. https://shavidan123.github.io/files/Covert_Influence_Between_Language_Models.pdf

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.

Avidan Shah

Posts by Collection

publications

teaching