LLM Agents in Production: What Actually Breaks

I've been building and deploying LLM agent systems in enterprise environments for the past two years. The failure modes in production are almost never what the demos suggest they might be. Demos fail gracefully and recover elegantly. Production fails in ways that are boring, expensive, and hard to debug.

Here's what actually breaks when you move LLM agents out of the notebook and into production.

Tool Calling Reliability

The thing that breaks most often and most expensively is tool calling. Not because the LLM calls the wrong tool — though that happens — but because the tools themselves aren't designed to be called by an LLM.

Human-facing APIs are designed with the assumption that a human is reading the error message. "Error 422: The field 'account_number' must be exactly 10 digits" makes sense to a developer looking at their screen. To an LLM, that error message is an instruction — and the LLM will often try to solve it by guessing what a 10-digit account number might look like, rather than asking the user for clarification or halting.

LLM-compatible tools need explicit contracts: structured input schemas, structured error responses that describe what went wrong in a way that's actionable for a model, idempotency for operations that might be retried, and rate limiting that accounts for an agent looping. These requirements are different from what you'd build for a human-facing UI or a synchronous API integration.

Context Window Management

Long-running agents accumulate context. Every tool call, every response, every iteration adds tokens to the context window. Eventually you hit the limit — or you hit the cost ceiling of a 100K+ token context window being processed on every LLM call in the loop.

Most agent frameworks don't handle this gracefully out of the box. The naive implementation drops the oldest messages when context gets too long — which frequently drops the instructions that told the agent what its goal was. Suddenly the agent has no memory of why it's running.

Production agent systems need explicit context management strategies: summarization of completed steps, structured memory stores for persistent state outside the context window, and explicit handling for the case where the agent has exceeded the context budget and needs to checkpoint and restart.

Prompt Injection via Tool Returns

If your agent retrieves documents, reads emails, searches the web, or processes any user-supplied content, it is vulnerable to prompt injection. Malicious content in the retrieval results can instruct the agent to ignore its original instructions and take a different action.

This is not a theoretical concern. I've seen production agent systems where a maliciously crafted document in a retrieval corpus caused the agent to exfiltrate data by embedding it in an outbound API call. The agent was following instructions — just not the instructions the operator intended.

Mitigations include: treating all retrieved content as untrusted and applying sandboxing, using separate context windows for instructions and retrieved data where the framework supports it, adding explicit instructions to the system prompt about ignoring injected instructions in retrieved content, and auditing all outbound actions the agent takes before they execute for sensitive operations.

Non-Determinism at Scale

LLMs are probabilistic. Given the same input, they produce slightly different outputs across calls. In demos this looks like creativity. In production workflows this is a correctness problem.

An agent that classifies incoming support tickets will occasionally misclassify. An agent that extracts data from contracts will occasionally miss a field or extract the wrong value. These errors compound in multi-step pipelines — a misclassification in step one changes the tool calls in step two, which changes the output in step three.

Production agent systems need evaluation pipelines that measure error rates on representative workloads before deployment, regression testing when model versions change (and model providers change model versions more often than they announce), and human-in-the-loop review for high-stakes decisions. The agent should know what it doesn't know and route uncertain cases to human review rather than guessing.

Observability

When a human-written function produces wrong output, you have a call stack, you have logs, you have a deterministic chain of execution to trace. When an LLM agent produces wrong output, you have a sequence of tokens and a series of tool calls. The gap between those two debugging experiences is significant.

Good agent observability requires logging every LLM call with its full prompt and response, logging every tool call with its inputs and outputs, tracking token usage per step, and measuring task completion rates and error rates at the aggregate level. Without this, you cannot debug failures, you cannot attribute costs, and you cannot evaluate whether a model upgrade improved or degraded performance.

LangSmith, Langfuse, and Weights & Biases all provide agent tracing infrastructure. If you're building production agents and you're not instrumenting them, you're flying blind.

The Pattern That Survives Contact With Production

The agent architectures that hold up in production share a few characteristics: they have clearly bounded scopes (an agent that does one thing well rather than an agent that can theoretically do anything), they treat every external interaction as potentially failing, they have explicit recovery paths for failure cases, and they surface uncertainty rather than resolving it silently.

The most reliable production agents I've built look less like autonomous reasoning systems and more like well-structured workflows with LLM decision nodes at specific steps. The intelligence is targeted. The blast radius of a bad decision is contained. The human is in the loop for anything irreversible.

That's less exciting than the AGI-adjacent demos that go viral. It's also what actually ships and stays in production.