Recent Updates
  • Started working for OpenAI (Contract) on Red Teaming
  • On leave from NYU to participate in the MATS Research Fellowship, advised by Shi Feng and Jacob Pfau.
  • Paper on poisoning jailbreak detection models accepted as Spotlight to ICML 2026 (Top 2.2% of accepted papers).

I do research on AI safety and security. Reach out to me via email if you’d like to get in contact for discussion or collaboration. I’m always open to talk about research ideas.

I’m broadly interested in the security and robustness for deep learning models as well as understanding their capabilities for generalization in out of distribution settings. I’ve recently been researching the ability of LLMs to encode and interpret information that is subliminal to humans. Currently, I’m focused on defending against data poisoning threats, which I believe pose a critical risk as the adoption of AI increases exponentially.

I’m currently advised by Shi Feng under the MATS research fellowship. At NYU, I’m advised by Rico Angell in He He’s research group. I’m also extremely fortunate for the mentorship of Chawin Sitawarin (Google DeepMind) during my undergraduate studies at UC Berkeley as a part of David Wagner’s group.

I’m a fan of landscape photography. Here’s a randomly sampled photo from my portfolio, most of which were taken on my Canon EOS R50:

Publications

Mitigating Adaptive Attacks against Reasoning Models with Activation Consistency Training
Avidan Shah, Jannik Brinkmann, & Rico Angell
arxiv, 2026
Activation Consistency Training (ACT) supervises internal representations rather than outputs, providing competitive defense against jailbreaks and prompt injection in reasoning LLMs while remaining robust to adaptive attacks.
On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation
Andy Han*, Kristina Fujimoto*, Avidan Shah*, Kiet Nguyen, Kai Xu, Chen Yueh-Han, Ilia Sucholutsky, & Rico Angell
arxiv, 2026
On-Policy Consistency Training (OPCT) supervises a model on its own responses conditioned on contrastive prompts, reducing sycophancy and maintaining jailbreak defense without the capability degradation that supervised fine-tuning induces.
Rapid Poison: Practical Poisoning Attacks Against the Rapid Response Framework
David Huang*, Jaewon Chang*, Avidan Shah*, Prateek Mittal, & Chawin Sitawarin
ICML (Spotlight), 2026
Practical prompt-injection-based poisoning attacks against the Rapid Response jailbreak detection framework, including a novel Omission Attack that flips ~90% of target labels with only 1% poisoning rate.
Stronger Universal and Transfer Attacks by Suppressing Refusals
David Huang, Avidan Shah, Alexandre Araujo, David Wagner, & Chawin Sitawarin
NAACL, 2025
A novel algorithm leveraging model refusal representation for automated jailbreaking suffix generation on LLMs
Efficient Mitigation of Bus Bunching through Setter-Based Curriculum Learning
Avidan Shah, Danny Tran, & Yuhan Tang
arxiv, 2023
Explores efficient solutions for transportation optimization via model based curriculum learning