PromptArmor scores 99% on AgentDojo. AgentDyn says no defense works yet.

The headline result in agent prompt-injection defense, as of the ICLR 2026 conference cycle, belongs to PromptArmor: false-positive and false-negative rates both under one percent on the AgentDojo benchmark, with attack success dropping below one percent after the defense removes injected prompts. Read in isolation, that result reads like a solved problem. Read alongside a paper published five months later, AgentDyn from Hao Li, Ruoyao Wen, Shanghao Shi, Ning Zhang, and Chaowei Xiao, it reads like an artefact of the benchmark, not the defense.

AgentDyn's headline claim is direct: "none of [the defenses] attain acceptable performance for real-world deployment on AgentDyn." The same defenses that benchmark near perfectly on AgentDojo fail on AgentDyn's task set. The gap between those two results is the entire story of where prompt-injection defense actually stands.

The two benchmarks are not measuring the same thing

The most useful way to read the AgentDyn paper is by sorting its critique of AgentDojo into the structural differences between the two benchmarks. AgentDojo formalized the multi-turn tool-use threat model and let researchers iterate on defenses; PromptArmor's result is real within that scope. AgentDyn's authors argue, with task-level evidence, that the scope has moved.

Tasks 97

Avg trajectory length 3 steps

Dynamic-planning tasks 6 / 97

BEST RESULT PromptArmor (GPT-4o) FPR 0.07% FNR 0.23%

AgentDyn

Tasks 60 + 560 injections

Avg trajectory length 7.1 steps

Dynamic-planning tasks 60 / 60

BEST RESULT Across all tested defenses "none attain acceptable performance"

The numbers behind the AgentDyn critique are concrete. AgentDojo has 97 tasks, of which 6 require dynamic planning. AgentDyn has 60 tasks, all 60 require dynamic planning, with an average trajectory length of 7.1 steps versus AgentDojo's 3, and an average of roughly 33 tools per task. A defense optimized for an environment where most tasks complete in three steps and don't require the agent to revise its plan is not necessarily a defense for an environment where every task requires sustained planning across seven steps with thirty-plus tools in scope.

Why the gap is not a methodology nitpick

It would be tempting to read the AgentDyn paper as a benchmark churn-cycle artefact: a new benchmark is harder than the old one, the old defenses underperform, that's how the field progresses. What makes this case different is that production agent deployments resemble AgentDyn's parameters more than AgentDojo's. Uber's MCP gateway runs 60,000 weekly executions across 10,000-plus internal services; the average agent working in that environment is not finishing a task in three steps with two tools in scope. The OpenTelemetry GenAI Semantic Conventions make agent_run spans first-class precisely because production agent executions are long-lived, multi-step, multi-tool objects.

A defense whose benchmark performance is excellent on a structurally simpler workload than the one a production system actually runs is not yet a production defense. AgentDyn's authors put the point directly: "existing defenses are either not secure enough or suffer from significant over-defense." Over-defense is the lesser-discussed failure mode and the one that matters operationally. A defense that catches 99% of injections but blocks 20% of legitimate tool calls is unusable in a production loop. AgentDyn measures that trade-off; AgentDojo, with its three-step tasks, has less room to surface it.

What this means for registries and gateways

The Agenstry-relevant question is what a downstream consumer of an MCP server or A2A agent should believe when a vendor says "prompt-injection defense in place." The honest answer today is that the claim is underspecified without a named benchmark. The same defense reaches one score on AgentDojo and a different score on AgentDyn, and the two scores describe operationally different products.

A registry that records "this server claims prompt-injection defense" is encoding a marketing string. A registry that records "this server, when probed against the AgentDojo task set, scored X; against the AgentDyn task set, scored Y" is encoding a measurement. The split is the same one this blog keeps coming back to: the claim is one object, the evidence is another, and the consumer needs to know which they are reading.

The same observation applies one layer up to enterprise gateways like the one Uber built. A gateway that ships with a "PromptArmor-style" defense by default has a defense for the AgentDojo threat model. Whether that defense holds against the workload of a long-running, multi-tool agent traversing the gateway is a separate empirical question. The methodology critique AgentDyn makes is the same one a serious gateway operator should be running internally before treating any single defense as load-bearing.

What we're watching

Three things, observable within the next two academic cycles:

Whether a PromptArmor-class defense closes the AgentDyn gap. The PromptArmor architecture (an off-the-shelf LLM that detects and removes injected prompts before the agent processes them) is architecturally agnostic to task complexity. Re-running the same defense against AgentDyn, and publishing the result, is the obvious next experiment. If the gap closes, the field has a working defense. If it doesn't, the field has a more honest problem.
Whether a new benchmark splits the over-defense axis explicitly. AgentDyn names over-defense as a failure mode and AgentDojo's task length makes it hard to surface. A benchmark that scored defenses on a two-dimensional surface — attack-success-rate-under-attack crossed with task-completion-rate-without-attack — would be the clearest version of the methodology AgentDyn is pointing toward.
Whether the OpenTelemetry GenAI Semantic Conventions absorb a prompt-injection-attempt attribute. A standard span attribute like gen_ai.security.prompt_injection_detected would give production deployments a way to surface defense triggers as first-class telemetry rather than as ad-hoc logs. Once that attribute exists, the population-scale answer to "how often is prompt injection caught in production" becomes computable.

A 99% score on a benchmark is the kind of headline that gets cited in slide decks for the next year. The smaller, less-cited paper saying the benchmark is the easy part is the one a production gateway should be reading. The next round of defenses will be measured against the harder workload by definition; the architecture that wins there probably isn't the architecture that won on the easier one.

Sources

PromptArmor: Simple yet Effective Prompt Injection Defenses — Shi, Zhu, et al., arXiv:2507.15219, ICLR 2026.
AgentDyn: A Dynamic Open-Ended Benchmark for Evaluating Prompt Injection Attacks of Real-World Agent Security System — Hao Li, Ruoyao Wen, Shanghao Shi, Ning Zhang, Chaowei Xiao, arXiv:2602.03117, February 2026.
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents — AgentDojo benchmark page on Inspect AISI evals, accessed May 2026.
Semantic Conventions for Generative AI Systems — OpenTelemetry, accessed May 2026.

The two benchmarks are not measuring the same thing

Why the gap is not a methodology nitpick

What this means for registries and gateways

What we're watching

Sources

Cookies on Agenstry