Skip to content
All posts
· 10 min read ·

Your agent's orchestration pattern matters more than the model behind it. Here's the data.

MAESTRO measured CRAG at 70.6% accuracy with $0.0010 per task and a tight latency distribution. Plan-and-Execute on comparable work: 48.3% accuracy, $0.0126 per task, latency IQR spanning an order of magnitude. Same backend models. The architecture decided everything.

The MAESTRO benchmark paper published in January wrote one sentence that the rest of this year's agent infrastructure discourse has been working out: MAS architecture is the dominant driver of resource profiles, reproducibility, and cost-latency-accuracy trade-off, often outweighing changes in backend models or tool settings. The authors meant it as a finding from their 12-system benchmark. Read as a design directive, it inverts the usual order of operations a team goes through when building an agent. The orchestration pattern is upstream of the model choice. Most planning happens in the other order.

The numbers behind the sentence are unambiguous. With the same task class and similar backend models, MAESTRO measured CRAG (Corrective Retrieval Augmented Generation) at 70.6% median accuracy with a $0.0010 median cost per task and a 42.8-second median duration. The Plan-and-Execute pattern on comparable work landed at 48.3% accuracy, $0.0126 cost, and a duration interquartile range of 30.6 to 356.6 seconds — an order of magnitude variance within the IQR alone. Same task class. Same backend. The architecture decided everything.

The pattern catalog, briefly

The agent ecosystem has converged on roughly four named patterns at the single-agent level, plus a handful of multi-agent compositions above them. The differences are real and well-documented enough to have entered LangGraph and other frameworks as prebuilt orchestration templates.

Four agent orchestration patterns and their trade-offs A two-by-two grid of pattern cards. Top left: ReAct, the alternating reason and act loop, single-agent default. Top right: Reflexion, ReAct with self-evaluation and memory, for tasks where failure modes repeat. Bottom left: Plan-and-Execute, planning first then linear execution, for problems where planning is the bottleneck. Bottom right: CRAG, Corrective RAG with self-grading, for knowledge bases of uneven quality. Each card lists the loop shape and the typical use case. SINGLE-AGENT ORCHESTRATION PATTERNS · LANGGRAPH-NATIVE ReAct reason → act → observe (loop) The canonical agent loop. Default for single-agent work. Cheap when the task resolves in <10 steps; loses ground when the agent needs to revise its plan. use: general-purpose first choice Reflexion ReAct + self-evaluation + memory Adds a critique step after each attempt and stores insights for future runs. More tokens per task; recovers from repeating failure modes that defeat ReAct. use: repeating-failure recovery Plan-and-Execute plan first → execute sequentially Architectural opposite of ReAct. High planning cost up front, then linear execution. Sensitive to plan quality. Highest variance in MAESTRO measurements. use: planning is the bottleneck CRAG retrieve → self-grade → optional web fallback Best MAESTRO numbers in its class: 70.6% accuracy, $0.0010 median cost, 42.8s median duration. Suited to knowledge bases of uneven quality. use: corpus quality varies

The patterns are not interchangeable. A team that picks Plan-and-Execute for a task that doesn't actually need a plan is paying the planning cost for no benefit. A team that picks ReAct for a task whose solutions repeat the same failure mode will keep hitting that failure mode. The patterns are the difference between paying the right cost and paying a different one for no return.

Why architecture dominates model choice

The MAESTRO finding is harder to dismiss than the usual methodology-paper-of-the-month because the data set is broad: 12 MAS across the LangGraph, Autogen, and ADK frameworks, controlled backend models, repeated runs. The structural reason the paper's conclusion holds is mechanical. The orchestration pattern determines how many LLM calls happen per task, how those calls compose, and what the context window contains at each step. The model determines the cost and quality of each individual call. The first variable is multiplicative; the second is per-call. The first wins arithmetically in production.

A more concrete example. A ReAct loop that resolves in three think/act cycles produces three LLM calls plus three observation reads. A Plan-and-Execute loop on the same task produces one planning call (long, with full task context) plus N execution calls (short, with per-subtask context). The two patterns can hit the same answer; their token bill differs by a factor of 3-10x depending on context structure. Swapping from Opus 4.7 to GPT-5.5 changes the per-call cost by 10-20%. Swapping from ReAct to Plan-and-Execute changes the call count by a factor of several.

The implication is direct: a team that wants to halve its agent's cost should look at the orchestration pattern before it shops the model providers. The pattern is the variable they own. The provider sets the floor.

The reliability story sits on the same axis

MAESTRO's other headline finding we covered in February was that the same MAS executed across repeated runs produced Jaccard edge-set similarity of 0.86 (high) but Longest Common Subsequence similarity of 0.65 (moderate). The set of interactions is stable across runs. Their order is not. Some patterns are more affected than others. Plan-and-Execute, with its planning step that can choose different orderings on different runs, is the dominant contributor to LCS variance in the paper. CRAG, with its retrieval step bounded by the corpus, is far more stable.

Reliability variance is not the same kind of cost as token spend, but it shows up the same way in production. An agent whose tool-call order varies between runs is harder to debug, harder to instrument with OpenTelemetry spans, and harder to commit to a contract with downstream consumers. Stability is a feature the pattern delivers or doesn't, and the model doesn't fix it after the fact.

What this means for the registry layer

A consumer asking "is this agent reliable?" or "what does this agent cost per task?" is asking questions whose answers depend more on the orchestration pattern than on the LLM provider in the agent's name. A registry that publishes only the model identifier ("this server is backed by Claude Opus 4.7") is publishing the smaller variable. The larger variable is "this server uses CRAG with a 30-second retrieval budget and a 3-stage critic chain." That is the metadata a probe-driven Agenstry-style registry can surface, and the metadata the public registries currently elide.

Two practical implications for builders integrating third-party agents:

  • A multi-agent system whose subagents use different orchestration patterns has different cost and reliability profiles per subagent. Treating "the system" as a single object hides the underlying variance. The MAESTRO methodology is the closest the field has to a way to surface it.
  • The OpenAI Agents SDK and similar frameworks abstract the pattern behind a default. That default is a pattern choice on the integrator's behalf. Knowing which one and why is the difference between accepting the framework's economics and optimizing for the task.

What we're watching

Three things, observable within the next two academic cycles:

  1. Whether a follow-up to MAESTRO covers Reflexion and Tree-of-Thoughts at the same level of detail. The current paper measured CRAG and Plan-and-Execute at high resolution; the other patterns appeared but with less measurement depth. A direct apples-to-apples comparison across all six common patterns would be the field's most useful single piece of measurement.
  2. Whether agent frameworks expose orchestration choice as a first-class deployment parameter. LangGraph already does in its prebuilt graphs; OpenAI Agents and others abstract it. The first widely adopted framework that lets a user A/B test ReAct against Plan-and-Execute on the same task without rewriting their agent will collapse the methodology cost the MAESTRO paper carried.
  3. Whether a public conformance probe for orchestration patterns emerges. A probe that runs the same task across an agent's declared pattern and measures the cost-accuracy-latency distribution would let registries publish the right signal — not "this agent uses ReAct" as a claim, but "this agent's actual behavior matches the ReAct profile we measured."

The agent infrastructure conversation spends most of its column inches on model choice because that is the variable the LLM providers compete on visibly. The orchestration pattern is the variable the agent operator actually owns, and the variable the data says dominates the others. The next year of measurement will likely shift the conversation. The orchestration choice deserves the column inches.

Sources

← Back to blog Agenstry