An AI agent is, at minimum, an LLM that can use tools and take actions in the world — send emails, query databases, call APIs, write code, execute commands. That description makes agents sound straightforward. The reality is that every capability you give an agent is also a new failure mode.
We've deployed 12 agentic systems in production: a compliance document analyst, a trading alert responder, a code review bot, a customer support escalation agent, a procurement pipeline, and several internal DevOps automation agents, among others. Here is what we know.
Planning Architectures: ReAct vs Plan-and-Execute
Two planning patterns dominate in practice:
ReAct (Reasoning + Acting) interleaves thought and action in a tight loop. The LLM reasons about its next step, takes an action (tool call), observes the result, reasons again, and so on. It's simple, highly responsive to new information, and works well for tasks with fewer than 10–15 steps.
# ReAct loop (simplified)
while not agent.is_done():
thought = agent.think(current_context) # LLM reasoning step
action = agent.select_action(thought) # Tool selection
result = tool_registry.execute(action) # Tool execution
current_context = agent.observe(result) # Update context
Plan-and-Execute separates planning from execution. The LLM produces a complete plan upfront (a DAG of sub-tasks), then an execution layer carries it out — possibly in parallel for independent branches. This is better for long-horizon tasks where ReAct gets lost in local optima, and it makes it easier to show users a plan for approval before execution begins.
Our rule of thumb: use ReAct for interactive, short-horizon agents (customer support, Q&A with retrieval); use Plan-and-Execute for batch tasks with defined completion criteria (document processing, data pipelines, compliance checks).
Tool Design: The Interface Is the Agent
The quality of an agent's tools is more determinative of its success than the choice of LLM. We've found that agents reliably fail on poorly designed tools — ambiguous descriptions, inconsistent naming, tools that do too much, or tools that return too little information on failure.
Principles we apply:
- One tool, one responsibility. A
send_email_and_log_to_crmtool is two tools with a misleading name. - Rich return types on failure. Return structured error objects with
error_code,message, andsuggested_action— not justnullor an HTTP 500. - Deterministic tools where possible. Tools that have side effects (send email, execute SQL) should be idempotent or explicitly labelled as non-idempotent with a confirmation requirement.
- Stateless tool interfaces. Pass all necessary context as parameters; don't rely on agents "remembering" earlier tool call results from outside the context window.
Memory Architectures
LLMs have a context window, not permanent memory. For agents that run over multiple sessions or need to learn from past interactions, you need an external memory layer. We use a three-tier model:
- Working memory: The current context window. Everything the agent knows for this invocation. Managed through careful context assembly — we summarise older turns rather than truncate them.
- Episodic memory: A vector store of past interactions, retrieved semantically. The agent can search for relevant past actions: "have I seen a similar error before? what did I do?"
- Semantic memory: A structured knowledge base (could be a RAG corpus) containing domain facts, rules, and policies that the agent should reason from.
For most production agents, episodic memory is the layer most teams neglect and most benefit from. A compliance agent that can retrieve "the last time I processed a document like this, I flagged section 4.2 for review" makes far fewer redundant mistakes.
Error Recovery: Plan for Failure
Agents encounter errors constantly — API timeouts, malformed tool responses, reasoning dead-ends. How an agent handles errors is as important as how it handles success.
We build three recovery mechanisms into every agent:
- Retry with context: On tool failure, the agent is given the error message and asked to reason about whether to retry, choose an alternative tool, or escalate.
- Max step limit: Every agent has a hard cap on total actions per invocation (typically 25–50). Agents that loop are expensive and dangerous.
- Graceful degradation: When an agent cannot complete a task with confidence, it produces a partial result with explicit uncertainty markers rather than a hallucinated confident answer.
MAX_STEPS = 30
for step in range(MAX_STEPS):
result = agent.step(context)
if result.status == "complete":
return result.output
if result.status == "needs_human":
return escalate_to_human(result.partial_output, result.reason)
if result.status == "tool_error":
context = context.append_error(result.error)
# Agent will reason about the error on next step
continue
# Exceeded max steps — escalate rather than guess
return escalate_to_human(context.partial_output, "max_steps_exceeded")
Human-in-the-Loop Checkpoints
The most important lesson from our production deployments: not all decisions should be fully autonomous. The engineering goal is not to remove humans entirely but to remove humans from decisions that are routine, low-stakes, and high-volume — and keep humans in the loop for decisions that are irreversible, high-stakes, or outside the agent's confidence envelope.
We classify agent actions into three tiers:
- Tier 1 — Fully autonomous: Read-only actions, low-stakes writes (draft creation, internal logging). No checkpoint required.
- Tier 2 — Supervised: External communications, financial transactions, configuration changes. Agent proposes action; human approves in a lightweight UI (a Slack message with approve/reject buttons).
- Tier 3 — Human-led: Legal filings, regulatory submissions, anything irreversible at scale. Agent provides analysis and a recommended action; human executes.
This classification doesn't limit agent value — it focuses it. Our compliance agent handles 300+ document reviews per day autonomously (Tier 1) while routing the 8–12 genuinely ambiguous cases to a human reviewer (Tier 2). That's a 96% reduction in human review volume with zero reduction in accuracy on the edge cases.
Evaluation Frameworks for Agents
Testing agents is qualitatively harder than testing deterministic software. You can't just assert on an output — you need to evaluate trajectories (was the sequence of steps reasonable?) and outcomes (did the agent accomplish the stated goal?).
We use a combination of:
- Trajectory evaluation: LLM-as-judge scoring each step in the agent's reasoning chain for relevance, accuracy, and efficiency. We use Claude Opus 4 as the judge for this.
- End-to-end task success rate: Against a golden set of 50–100 tasks with known correct outcomes. Track this over time; regressions require root-cause analysis before deployment.
- Tool call accuracy: Did the agent call the right tools with the right parameters? Measured separately from whether the tools returned useful results.
- Hallucination rate: For agents that synthesise information from retrieved context, we run RAGAS faithfulness scores on a sample of outputs daily.
Target end-to-end task success rates by tier: Tier 1 tasks ≥ 95%, Tier 2 tasks ≥ 85% (with human confirmation catching the remainder), overall hallucination rate < 2%.
Common Failure Modes We've Hit
- Tool result hallucination: Agent claims a tool returned something it didn't. Fixed by strict output parsing and schema validation on all tool return values.
- Context window exhaustion: Agent loses its place in long tasks. Fixed by periodic context summarisation and session checkpointing.
- Goal drift: Agent successfully completes a sub-task and then generates unnecessary additional steps. Fixed by explicit "done" conditions in the system prompt and max-step limits.
- Cascading failures: One failed tool call causes the agent to spiral into increasingly confused reasoning. Fixed by structured error context and a "confused state" detector that escalates rather than continuing.
- Prompt injection via tool output: Malicious content in a tool's return value attempts to override agent instructions. Fixed by sandboxed tool output formatting and explicit trust boundaries in the system prompt.
Key Takeaways
- ReAct for short-horizon interactive tasks; Plan-and-Execute for long-horizon batch tasks.
- Tool quality matters more than model choice. Invest in rich error return types and single-responsibility interfaces.
- Build three-tier memory (working, episodic, semantic) — episodic memory has the highest ROI for production agents.
- Hard max-step limits and structured error recovery are not optional. Agents that loop are expensive and dangerous.
- Human-in-the-loop is a feature, not a failure. Classify decisions by reversibility and stake; automate Tier 1, supervise Tier 2.
- Measure trajectory quality, task success rate, and hallucination rate — all three independently.