All posts
· 10 min read ·

The cost of running an agent has five inputs. Only one of them is on the LLM rate card.

Claude Opus 4.7 and GPT-5.5 publish token rates. Production agent infrastructure analyses put those rates at roughly 38 percent of total spend — the other 62 percent is observability, orchestration, memory, and integration overhead that no public rate card describes. The largest variance in agent cost is architectural, not provider-driven.

The cheap part of running an agent is the part most coverage quotes. Claude Opus 4.7 prices at $5 per million input tokens and $25 per million output tokens. GPT-5.5 runs $5/$30 on the same basis. Those rate cards are public, comparable, and easy to plug into a spreadsheet. They are also, according to multiple production-cost breakdowns shipping this month, roughly 38% of what a real production agent actually spends in infrastructure. The remaining 62% lives in orchestration, observability, memory, and integration layers that no single rate card describes.

That gap is structural, and it's the part of agent economics nobody quotes when comparing model providers.

Five inputs, only one in the rate card

Production agent cost breakdown: five inputs, only one on the LLM rate card A horizontal stacked bar showing the approximate cost share of five infrastructure inputs that determine the total cost of running a production AI agent. LLM tokens account for roughly 38 percent, observability and tracing 22 percent, orchestration runtime 18 percent, memory and vector storage 14 percent, and tool plus MCP overhead 8 percent. Only the first segment maps to a published LLM rate card. The other four are vendor-specific deployment costs that are not directly comparable across providers. APPROXIMATE COST SHARE · PRODUCTION AGENT INFRASTRUCTURE 38% LLM TOKENS 22% OBSERVABILITY 18% ORCHESTRATION 14% MEMORY 8%

Only the pink segment maps to a published rate card. The four violet/gray segments are vendor- and deployment-specific.

THE FIVE INPUTS 1. LLM tokens — input + output, tokenizer differences, prompt caching 2. Observability — span ingest, retention, alerting 3. Orchestration runtime — multi-agent dispatch, durable execution 4. Memory and vector retrieval — pgvector, Redis sessions, Pinecone 5. Tool and MCP overhead — schema tokens in context, MCP server hosting

Each of the five inputs has its own pricing model, its own vendor ecosystem, and its own variability across deployments. Token costs are public because the LLM providers compete on them visibly. The other four are quoted in usage units that don't compose easily: observability vendors charge per-span or per-trace, vector DBs per GB-month, orchestration runtimes per execution-second, MCP servers per request or per agent. A spreadsheet that estimates "agent cost" from the rate card alone is, in practice, undercounting by close to 3x in production.

Where the numbers come from

A few specific data points anchor the chart above. Multiple 2026 agent infrastructure analyses quote 62% of infrastructure cost falling on observability and orchestration rather than the model API. Production agent operating budgets range from $3,200 to $13,000 per month per agent serving real users, with the LLM API fraction typically the smallest line item. Three-year total cost of ownership analyses put initial development at 25–35% of three-year spend, with the remaining 65–75% in ongoing operation.

Two specific token-pricing wrinkles worth flagging because they distort the rate-card comparison further:

  • Claude Opus 4.7 ships with a new tokenizer that can produce up to 35% more tokens for the same input text as prior Claude models. The per-token rate is unchanged from Opus 4.6; the per-request cost can rise without any visible price change.
  • The conventional MCP server schema-loading pattern carries a substantial in-prompt token tax we covered last week. An agent connected to a dozen MCP servers can pay 30,000–50,000 tokens of context per prompt for the privilege of seeing the tool catalog. That cost is amortized across many requests for a long-lived agent session, but it shows up in the per-token line even though it is fundamentally an integration overhead.

The combined effect is that "this model costs $5/M input" is a true sentence that under-specifies what a deployment will pay.

What this means for registries

Most public registries cataloging MCP servers and agents do not publish cost-per-call data. The omission is the same one we've made the case for at the conformance layer: claims about behavior are cheap, evidence is expensive, and the consumer's decision depends on the latter.

A consumer evaluating two MCP servers with overlapping coverage will want to know which one consumes more context tokens per call. A consumer choosing between Code Mode-style code-execution agents and traditional MCP-tool-call agents needs the cost breakdown across all five inputs, not just the token rate. Agenstry's funnel describes the conformance shape of a server's behavior. The parallel work for cost economics would describe its observed cost shape: median token consumption per call, observability output volume, orchestration overhead per request, integration burden per session. Each is measurable. None is yet published at the registry layer at population scale.

Implications for agent platform teams

A team building or operating production agents has a different cost profile than the marketing copy suggests. The implications follow directly from the breakdown:

  • The model choice matters less than the model assumes. Switching from Opus 4.7 to GPT-5.5 changes the largest single line item by maybe 10–20%. Switching from a tool-catalog MCP integration to a Code Mode-style runtime changes context-token consumption by 99.9%. The architectural decisions usually dominate.
  • Observability is a fixed cost the rate card never warned you about. OpenTelemetry GenAI conventions are now the substrate for that cost; the vendor charges roll up to it. Budgeting for it as a first-class line item is how teams stay surprised once instead of monthly.
  • Memory and vector storage are the line items that scale with user behavior, not request volume. A 10x increase in active users with the same per-user request rate produces roughly a 10x increase in vector-store spend independent of LLM token spend.
  • Tool/MCP overhead is the easiest line to reduce. It's also the line nobody on the team owns by default. The right group to optimize this is whoever owns the registry/gateway layer, which is why getting that layer named is part of the cost story.

The headline cost figures will keep coming from the LLM providers because that is the line they own and the line they compete on. The deployment-shape cost figures will keep diverging from the headline because they are functions of architecture, not of model pricing. The next year of agent infrastructure conversation will probably spend more time on this gap than on any single rate-card change.

What we're watching

Three things, observable within the next two quarters:

  1. Whether a public infrastructure-cost benchmark emerges for production agents. MAESTRO's per-task cost numbers ($0.0010 for CRAG, $0.0126 for Plan-and-Execute) are the closest thing the field has to a normalized comparison today. A benchmark that adds the five-input breakdown and tracks all of them over time would be a real public good.
  2. Whether OpenTelemetry GenAI Semantic Conventions absorb a cost-attribution attribute family. A standardized gen_ai.cost.usd span attribute would let observability vendors normalize the rate-card information into the same trace they already collect. Today, cost attribution is a separate spreadsheet from the trace.
  3. Whether the public MCP registries publish per-server context-token consumption. A consumer choosing between two functionally equivalent MCP servers would benefit from knowing that one costs 5,000 tokens of context and the other costs 50,000. That difference compounds across every prompt; it is the largest underpublished signal at the registry layer today.

The rate-card comparison between Opus 4.7 and GPT-5.5 will keep appearing in benchmarks. The cost economics for the agent built on top of either of them will keep diverging from that comparison by a factor most teams discover after they're already in production. The registry layer is the obvious place to expose that divergence before it surprises someone.

Sources

← Back to blog Agenstry