Stronger Universal and Transfer Attacks by Suppressing Refusals
Published in NAACL 2025 (abridged version accepted NeurIPS SafeGenAI 2024), 2025
A novel algorithm leveraging model refusal representation for automated jailbreaking suffix generation on LLMs
Recommended citation: Huang, D., Shah, A., Araujo, A., Wagner, D., & Sitawarin, C. (2025). Stronger universal and transfer attacks by suppressing refusals. NAACL 2025. https://shavidan123.github.io/files/NAACL__Stronger_Universal_and_Transferable_Attacks_by_Refusal_Suppression.pdf