All posts
· 9 min read ·

PromptArmor scores 99% on AgentDojo. AgentDyn says no defense works yet.

PromptArmor's sub-one-percent false-positive and false-negative rates on AgentDojo got the conference headlines. A five-months-later paper, AgentDyn, finds that no defense attains acceptable real-world performance against longer, more dynamic agent workloads. The gap between the two results is the actual state of the field.

The headline result in agent prompt-injection defense, as of the ICLR 2026 conference cycle, belongs to PromptArmor: false-positive and false-negative rates both under one percent on the AgentDojo benchmark, with attack success dropping below one percent after the defense removes injected prompts. Read in isolation, that result reads like a solved problem. Read alongside a paper published five months later, AgentDyn from Hao Li, Ruoyao Wen, Shanghao Shi, Ning Zhang, and Chaowei Xiao, it reads like an artefact of the benchmark, not the defense.

AgentDyn's headline claim is direct: "none of [the defenses] attain acceptable performance for real-world deployment on AgentDyn." The same defenses that benchmark near perfectly on AgentDojo fail on AgentDyn's task set. The gap between those two results is the entire story of where prompt-injection defense actually stands.

The two benchmarks are not measuring the same thing

The most useful way to read the AgentDyn paper is by sorting its critique of AgentDojo into the structural differences between the two benchmarks. AgentDojo formalized the multi-turn tool-use threat model and let researchers iterate on defenses; PromptArmor's result is real within that scope. AgentDyn's authors argue, with task-level evidence, that the scope has moved.

AgentDojo versus AgentDyn benchmark properties Two side-by-side panels comparing the AgentDojo and AgentDyn benchmarks. AgentDojo: 97 tasks, average trajectory length 3 steps, 6 tasks require dynamic planning, PromptArmor false-positive rate 0.07 percent. AgentDyn: 60 tasks, 560 injection cases, average trajectory length 7.1 steps, all 60 require dynamic planning, average 33 tools per task, no defense reaches acceptable performance. PROMPT-INJECTION BENCHMARK · AGENTDOJO vs AGENTDYN AgentDojo

Tasks 97

Avg trajectory length 3 steps

Dynamic-planning tasks 6 / 97

BEST RESULT PromptArmor (GPT-4o) FPR 0.07% FNR 0.23%

AgentDyn

Tasks 60 + 560 injections

Avg trajectory length 7.1 steps

Dynamic-planning tasks 60 / 60

BEST RESULT Across all tested defenses "none attain acceptable performance"

The numbers behind the AgentDyn critique are concrete. AgentDojo has 97 tasks, of which 6 require dynamic planning. AgentDyn has 60 tasks, all 60 require dynamic planning, with an average trajectory length of 7.1 steps versus AgentDojo's 3, and an average of roughly 33 tools per task. A defense optimized for an environment where most tasks complete in three steps and don't require the agent to revise its plan is not necessarily a defense for an environment where every task requires sustained planning across seven steps with thirty-plus tools in scope.

Why the gap is not a methodology nitpick

It would be tempting to read the AgentDyn paper as a benchmark churn-cycle artefact: a new benchmark is harder than the old one, the old defenses underperform, that's how the field progresses. What makes this case different is that production agent deployments resemble AgentDyn's parameters more than AgentDojo's. Uber's MCP gateway runs 60,000 weekly executions across 10,000-plus internal services; the average agent working in that environment is not finishing a task in three steps with two tools in scope. The OpenTelemetry GenAI Semantic Conventions make agent_run spans first-class precisely because production agent executions are long-lived, multi-step, multi-tool objects.

A defense whose benchmark performance is excellent on a structurally simpler workload than the one a production system actually runs is not yet a production defense. AgentDyn's authors put the point directly: "existing defenses are either not secure enough or suffer from significant over-defense." Over-defense is the lesser-discussed failure mode and the one that matters operationally. A defense that catches 99% of injections but blocks 20% of legitimate tool calls is unusable in a production loop. AgentDyn measures that trade-off; AgentDojo, with its three-step tasks, has less room to surface it.

What this means for registries and gateways

The Agenstry-relevant question is what a downstream consumer of an MCP server or A2A agent should believe when a vendor says "prompt-injection defense in place." The honest answer today is that the claim is underspecified without a named benchmark. The same defense reaches one score on AgentDojo and a different score on AgentDyn, and the two scores describe operationally different products.

A registry that records "this server claims prompt-injection defense" is encoding a marketing string. A registry that records "this server, when probed against the AgentDojo task set, scored X; against the AgentDyn task set, scored Y" is encoding a measurement. The split is the same one this blog keeps coming back to: the claim is one object, the evidence is another, and the consumer needs to know which they are reading.

The same observation applies one layer up to enterprise gateways like the one Uber built. A gateway that ships with a "PromptArmor-style" defense by default has a defense for the AgentDojo threat model. Whether that defense holds against the workload of a long-running, multi-tool agent traversing the gateway is a separate empirical question. The methodology critique AgentDyn makes is the same one a serious gateway operator should be running internally before treating any single defense as load-bearing.

What we're watching

Three things, observable within the next two academic cycles:

  1. Whether a PromptArmor-class defense closes the AgentDyn gap. The PromptArmor architecture (an off-the-shelf LLM that detects and removes injected prompts before the agent processes them) is architecturally agnostic to task complexity. Re-running the same defense against AgentDyn, and publishing the result, is the obvious next experiment. If the gap closes, the field has a working defense. If it doesn't, the field has a more honest problem.
  2. Whether a new benchmark splits the over-defense axis explicitly. AgentDyn names over-defense as a failure mode and AgentDojo's task length makes it hard to surface. A benchmark that scored defenses on a two-dimensional surface — attack-success-rate-under-attack crossed with task-completion-rate-without-attack — would be the clearest version of the methodology AgentDyn is pointing toward.
  3. Whether the OpenTelemetry GenAI Semantic Conventions absorb a prompt-injection-attempt attribute. A standard span attribute like gen_ai.security.prompt_injection_detected would give production deployments a way to surface defense triggers as first-class telemetry rather than as ad-hoc logs. Once that attribute exists, the population-scale answer to "how often is prompt injection caught in production" becomes computable.

A 99% score on a benchmark is the kind of headline that gets cited in slide decks for the next year. The smaller, less-cited paper saying the benchmark is the easy part is the one a production gateway should be reading. The next round of defenses will be measured against the harder workload by definition; the architecture that wins there probably isn't the architecture that won on the easier one.

Sources

← Back to blog Agenstry