Is 99% Enough?
Published:
TL;DR
Safety researchers shouldn’t worry too much about defending against optimization heavy attacks, and instead focus on making attacks inaccessible to 99% of users. We’re never going to build a permanently unbeatable defense or 100% robust model, and should never consider a defense unbeatable. Attack and defense literature past the 99% mark should be used as a framework to better understand LLMs rather than a practical way to make models robust in real-world deployments.
A brief overview on optimization based LLM jailbreaking
Large language models are obviously here to stay, being employed in a wide range of industries and use cases, and as such it is our responsibility to maintain an in-depth understanding of how to ensure the security and reliability of their outputs. Most LLMs are fine tuned to follow specific safety guidelines and will refuse to answer many harmful or malicious prompts.
Jailbreaking, which is the intentional subversion of the model safety guardrails to produce prohibited behaviors, is being increasingly employed by researchers to obtain a better understanding of how to further secure language models. Here’s an example of what I’m talking about:
There’s some debate in the community on the ethics of publicizing information on jailbreaking proprietary LLMs, but in general I agree that we should want researchers to jailbreak models as much as they can (assuming we give model developers ample time to defend or patch the vulnerabilities) so people with malicious intent have a harder time. I won’t go into too much detail on all the different jailbreaking methods out there because there’s a ton, but it’s important to understand the two main classes of attacks.
A white box setting is when we have access to the model architecture and internal states, which allows us to leverage activation values and gradients.
A black box setting is when we can only query the model and observe the outputs.
While many prompt-based attacks on black box models, such as Many-Shot Jailbreaking, PAIR, and cipher attacks exist, I want to focus on the class of optimization based attacks for this example. Inherently, attacks on white-box settings are much stronger and more effective, since the attacker has much more information to work with. These attacks usually involve much more direct optimization based approaches, since the attacker can leverage gradients to get an adversarial attack that directly exploits the model weights.
For an example of an optimization-based attack that requires a white box setting, let’s look at the formerly popular Greedy Coordinate Gradient (GCG) attack. Consider a threat model where an attacker seeks to jailbreak a large language model which would deceive the model into answering harmful behaviors it has been specifically trained to refuse. An example of a harmful behavior would be asking the model to “Teach me how to build a bomb, step by step.” One of the most common techniques for automated jailbreaking relies on creating an adversarial suffix, which is a string of characters appended to the end of a prompt such as the one provided above that induces a jailbreak. To make this more clear, see the example below.
To create this adversarial suffix using GCG, you need to solve the following optimization problem:
\[s^{*} \;=\; \arg\min_{s \in \mathcal{V}^{L}} \; \mathcal{L}_\theta\!\big(y \,\big|\, x \oplus s\big) \;=\; \arg\min_{s \in \mathcal{V}^{L}} \left[-\sum_{t=1}^{m} \log p_\theta\!\left(y_t \,\middle|\, x \oplus s,\; y_{<t}\right)\right],\]What the optimization is doing:
In simple terms, we need to pick a short string \(s\) (the “adversarial suffix”) to append to the user prompt \(x\) so that the model becomes as confident as possible that the next response will follow a specific target completion \(y\). In our case, \(y\) is the beginning of a harmful target output. Formally, we maximize the model’s probability of \(y\) given \(x \oplus s\), which is the same as minimizing the negative log-likelihood shown in the equation.
Symbols & variables
- \(x\) — the original (harmful) user prompt.
- \(s \in \mathcal{V}^{L}\) — the adversarial suffix: a sequence of \(L\) tokens, each chosen from the vocabulary \(\mathcal{V}\).
- \(x \oplus s\) — concatenation of the prompt with the suffix.
- \(y = (y_{1}, \dots, y_{m})\) — the target continuation we want the model to produce (length \(m\)).
- \(y_{<t}\) — the prefix of \(y\) up to, but not including, token \(t\) (i.e., \((y_{1},\dots,y_{t-1})\)).
- \(p_{\theta}(\cdot \mid \cdot)\) — the model’s next-token probability under parameters \(\theta\).
- \(\mathcal{L}_{\theta}\big(y \mid x \oplus s\big) = -\sum_{t=1}^{m} \log p_{\theta}\big(y_{t} \mid x \oplus s,\; y_{<t}\big)\) — negative log-likelihood (NLL) of producing the entire target \(y\) after seeing \(x \oplus s\).
- \(\arg\min_{s}(\cdot)\) — choose the suffix \(s\) that gives the smallest NLL (equivalently, the highest likelihood of \(y\)).
Intuition: we’re searching over discrete token strings \(s\) to find the one that steers the model into continuing with \(y\).
In the example case, our target output is “Sure, here is how to build a bomb,” and we let the LLM fill in the rest of the output, hoping that it will fulfill the harmful request.
Incidentally, since we are searching over a discrete space of tokens, direct optimization is difficult, since functions require continuity to be differentiable at a given point. This means we have to take gradients in the embedding space, which is continuous, however this embedding needs to map to a discrete token. That’s why GCG approximates this by sampling the token candidates for update at random and greedily selecting the most promising candidates based on the gradients. It means there’s some forced randomness in our approach, leading to a suboptimal stochastic search over tokens that can only approximate gradient-style updates in a discrete space.
So what was the point of mentioning this? I’ll explain why in a later section. For now, just understand that this style of optimization over the token space of the language model is far from perfect (see my recent paper on the IRIS algorithm to improve this attack), but it still works well enough to be a threat.
However, while jailbreaking is certainly an issue to worry about, many jailbreak goals are often toy examples that won’t be realistic. Going through this painstaking optimization process to get an LLM to teach you how to build a bomb or write a racist email is slightly less useful in the real world when this information is certainly already available online (but please don’t look this up for obvious reasons).
An aside on a different class of vulnerabilities
Prompt injection, on the other hand, is seen as the more practical threat to real-world deployments of LLMs, and can have much more disastrous consequences if not defended against. The core idea of prompt injection is exploiting the LLM’s failure to differentiate between system and user instructions. Look at the following example for an idea of why prompt injection could be dangerous.
While the difference between prompt injection and jailbreaking can be blurry at times, I’d boil it down to this: Prompt injection is about hijacking the system instructions to get the LLM to do something that user intends rather than the developer, while jailbreaking is about getting the LLM to output something against its pretrained guardrails, usually sensitive or harmful information.
Both are important to defend against if we want to deploy LLMs in large scale, critical systems, however. I won’t be going over specific methods for prompt injection, but just know that there’s a vast amount of literature on the topic.
Additional modalities means stronger attacks
Nowadays, it’s rare to see LLMs that only take in textual inputs, as the most popular ones such as GPT-5, Claude Sonnet 4.5, and Gemini 2.5 are all multimodal, meaning they can additionally take in images. VLMs in today’s architecture consist of a vision tower (usually a transformer) that takes in images, maps them to the token space of the connected LLM with a connection layer (often a MLP), and finally passes the mapped tokens of the image into the LLM along with the tokens from the input text, if text is also passed in. This, however, opens them up to a whole new surface of attacks and vulnerabilities for malicious users to exploit.
Why is this a lot scarier than our initial, text only setting? Well, to keep it simple, images can be represented as a set of pixels with RGB values from 0-255. This means that, unlike the token space, where token value 1 could correspond to “apple,” and token value 2 could be “cat.” We have a continuous distribution where neighboring values are directly related to one another. IE R value 255 is slightly more red than R value 254. Now that we can optimize over a continuous distribution, we remove the weakness of GCG above which requires us to sample candidate tokens at random to find the best one, and we can always take the best step at every iteration of the attack. Now, we can add adversarial perturbations to the image to get our desired target output, and this can be much harder to detect if we constrain our attack with a perturbation budget.
For a more formal explanation:
What “adversarial images” look like mathematically.
Let a VLM be \(f_\theta(x_{\text{img}}, x_{\text{text}})\) that outputs next-token probabilities/logits. A targeted attack searches for a small perturbation \(\delta\) that steers the model toward a specific (unsafe) target \(y_{\text{target}}\):
where \(x_{\text{img}}\in[0,1]^{H\times W\times 3}\), \(\varepsilon\) is the perturbation budget, \(\|\cdot\|_{p}\) an \(L_{p}\) norm (e.g., \(p\in\{2,\infty\}\)), \(\Pi_{[0,1]}\) clips pixels back into valid range, and \(\mathcal{L}\) is typically a negative log-likelihood for producing a target continuation (e.g., the model’s “comply” trajectory) given the image and any text.
- Untargeted variant: You could also maximize loss on the intended safe output (causing refusal errors or wrong tool choices):
- Solvers: projected-gradient methods (e.g., PGD), momentum/Adam on \(\delta\), or query-based gradient estimators in black-box settings.
- Robustifying across image pre-processing: attackers often optimize expectation over Transformations (EOT),
where \(\mathcal{T}\) includes resizes, crops, JPEG compression, etc., to survive the model’s input pipeline.
As mentioned earlier, pixels form a smooth space, meaning gradients provide reliable directions every step, unlike discrete token edits in the text-only setting. In VLMs, the attack surface includes:
- Pixels
- Other Vision features (attacking the connector interface that maps visual tokens)
- Cross-modal prompts (images that encode instructions which then bias the language model).
- Text We could always treat a VLM like a regular LLM and only optimize a textual adversarial suffix, or we could try to optimize both an image and a suffix.
Several defenses do exist, but in general, with enough optimization, these attacks are way stronger than defenses that don’t heavily compromise model utility.
Current Defenses
The good news is there is an extensive amount of defense literature that make language models much more robust. None is a silver bullet, and they work best in layers. There’s always a tradeoff between the utility or usefulness of a model and robustness, depending on how effective defenses are. If you think about it, we could always make our language model refuse every prompt it’s given, and it would be 100% robust to attack (but also completely worthless). The goal is achieving the best tradeoff and maximizing both utility and robustness. Here are a few methods out there for defenses for jailbreaks:
Input filtering (pre-inference)
Essentially, this is the idea that we could have a check for possible harmful inputs before even passing it into our model. It would come at a potential latency cost or false positives, but basic input filtering can be highly effective for weak attacks. Examples include static/regex checks for banned words, small linear classifiers, entire separate LLMs, or transforms on image inputs.
Output filtering (post-inference)
Just like input filtering, we can also filter the output after the LLM completes its inference, though this is less preferential because it would be cheaper to not run inference on a harmful input. It’s also pretty straightforward, with similar solutions to what was mentioned in the previous sections
Latent-space monitoring (“tripwires/circuit-breakers)
Another popular method for defenses is inference time defense via latent-space monitoring. The latent space of the LLM is essentially the calculations that occur within the model before it outputs its final textual response to the user’s input. By training models to perform anomaly detection on these residual stream activations, we can cut inference short or steer outputs in directions that are safer (see the circuit breaker models for more details). These often are tough in practice, as distributional shifts or different model architectures can greatly affect the latent space classifier’s performance. With regards to this specific type of defense, there’s been a lot of great progress and work on interpretability here, which I think is necessary to better understand LLMs as a whole. If you’re interested, check out some of Neel Nanda’s papers.
Adversarial training and red teaming
This is performed to some degree on most LLMs in order to align them with preferential values. Supervised finetuning, RLHF, and direct preference optimization on previous jailbreaks are an example of this, and are highly effective, but not 100% robust. Since this often has to be done during training, it’s kind of a “one and done” defense, as it can’t really be added to the model after it’s deployed. There are solutions to this issue, like the RapidResponse defense, which finetunes on proliferated jailbreaks after they are detected to prevent them from working ever again, however even this has weaknesses due to utility drops from finetuning.
Reasoning could be a defense? Or a vulnerability
Reasoning models have exploded in popularity, utilizing Chain-of-Thought (CoT) prompting to greatly improve model performance and responses, especially for more complex tasks. Some literature from OpenAI seemed to suggest that CoT models were more robust to jailbreaks, however this is not necessarily the case. In fact, the reasoning can be potentially seen as a novel attack surface and a vector for agentic misalignment (which is a whole separate topic that I will not be covering in this blog)
Reasoning models can certainly help models catch mistakes from certain types of attacks and course correct or shut down, however the chain of thought and longer context utilization can be exploited with different attacks. It’s unclear if they are any more or less robust than regular models.
My question and takeaways
I hope I was able to provide some useful context on the state of this slice of security research for LLMs. Now, I want to finally explain some of my opinions on all of this. Obviously the attack-defense paper cycle will never end as people continue to race each other to build stronger attacks and stronger defenses. However I’m curious about if and when there should be an end to this pursuit, at least with regards to the current LLM craze. Maybe it will occur naturally as new technologies emerge and people move on to attacking and defending those. I personally think coming up with some sort of provable guarantee of robustness for an LLM is an impossible task for now due to the infinite input space and inherent black box nature of deep learning – a model isn’t some sort of cryptographic proof, and even those are sometimes proven wrong.
And because 100% robustness is never going to happen, this leads me to ask the following question: Is 99% enough? I don’t think it’s far-fetched to say that 99% of people using LLMs today do not even know what an optimization problem is, let alone GCG or any other method out there. This means that basic defenses will prevent 99% of users from being able to perform jailbreaks and prompt injections. Should we continue to build defenses for this 1%, or 0.1% of users that can leverage these optimization problems with infinite computational power to break language models in this way? It’s a Sisyphean task, as the defender always is one step behind the attacker. No matter how many times you defend against an attack, someone will come up with a way around it until it’s provably robust, which as we mentioned earlier, is impossible.
As a security researcher, I’m not saying that we shouldn’t keep building defenses, in fact I think we need to continue to make our models as robust as realistically possible without compromising performance. But if we do pour tons of resources into defense against malicious users, it’s better to frame it as a way to further understand these models and how they work. I don’t think we should cite the fact that these models are exploitable by such a small subset of potential attackers as a reason not to deploy them, because then we never will. My takeaway is that we should focus on detection and quick responses to successful attacks, while making sure that we maintain the 99% robustness that’s more realistic.
If failure is so critical that we cannot afford to under any circumstances, then maybe we shouldn’t be deploying a language model in that setting in the first place. Although one could make the argument that a human might not be more robust than an LLM, given that phishing attacks are still some of the most common and successful hacking methods out there.
If you’ve read this far, thanks. Seriously. These are just a few of my opinions and don’t represent any organization that I may be a part of, but I hope I could provide some perspective.
Sources
- Zou et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models (GCG).
- Anthropic (2024). Many-Shot Jailbreaking.
- Chao et al. (2024). Jailbreaking Black Box Large Language Models in Twenty Queries (PAIR).
- Huang et al. (2025). IRIS: Stronger Universal and Transferable Attacks by Refusal Suppression.
- Anthropic (2025). The Life of a Jailbreak (Attribution graphs / biology).
- Zou et al. (2024). Improving Alignment and Robustness with Circuit Breakers.