Skip to content
Back to Insights
AI Agents LLM Autonomous AI Agentic AI AI Engineering

Building Reliable AI Agents: Lessons From Deploying 12 Autonomous Systems

AI agents that work in demos routinely fail in production. The failure modes are qualitatively different from those of traditional software — they're probabilistic, context-sensitive, and often invisible until they've caused real damage. This post documents what we've learned deploying 12 autonomous systems across FinTech, legal, and internal engineering use cases.

Codecanis Admin

10 min read
AI agent abstract visualization
One of twelve agentic systems in production — this one resolves 78% of tier-1 support tickets without human escalation.

An AI agent is, at minimum, an LLM that can use tools and take actions in the world — send emails, query databases, call APIs, write code, execute commands. That description makes agents sound straightforward. The reality is that every capability you give an agent is also a new failure mode.

We've deployed 12 agentic systems in production: a compliance document analyst, a trading alert responder, a code review bot, a customer support escalation agent, a procurement pipeline, and several internal DevOps automation agents, among others. Here is what we know.

Planning Architectures: ReAct vs Plan-and-Execute

Two planning patterns dominate in practice:

ReAct (Reasoning + Acting) interleaves thought and action in a tight loop. The LLM reasons about its next step, takes an action (tool call), observes the result, reasons again, and so on. It's simple, highly responsive to new information, and works well for tasks with fewer than 10–15 steps.

# ReAct loop (simplified)
while not agent.is_done():
    thought = agent.think(current_context)       # LLM reasoning step
    action  = agent.select_action(thought)        # Tool selection
    result  = tool_registry.execute(action)       # Tool execution
    current_context = agent.observe(result)       # Update context

Plan-and-Execute separates planning from execution. The LLM produces a complete plan upfront (a DAG of sub-tasks), then an execution layer carries it out — possibly in parallel for independent branches. This is better for long-horizon tasks where ReAct gets lost in local optima, and it makes it easier to show users a plan for approval before execution begins.

Our rule of thumb: use ReAct for interactive, short-horizon agents (customer support, Q&A with retrieval); use Plan-and-Execute for batch tasks with defined completion criteria (document processing, data pipelines, compliance checks).

Tool Design: The Interface Is the Agent

The quality of an agent's tools is more determinative of its success than the choice of LLM. We've found that agents reliably fail on poorly designed tools — ambiguous descriptions, inconsistent naming, tools that do too much, or tools that return too little information on failure.

Principles we apply:

  • One tool, one responsibility. A send_email_and_log_to_crm tool is two tools with a misleading name.
  • Rich return types on failure. Return structured error objects with error_code, message, and suggested_action — not just null or an HTTP 500.
  • Deterministic tools where possible. Tools that have side effects (send email, execute SQL) should be idempotent or explicitly labelled as non-idempotent with a confirmation requirement.
  • Stateless tool interfaces. Pass all necessary context as parameters; don't rely on agents "remembering" earlier tool call results from outside the context window.

Memory Architectures

LLMs have a context window, not permanent memory. For agents that run over multiple sessions or need to learn from past interactions, you need an external memory layer. We use a three-tier model:

  1. Working memory: The current context window. Everything the agent knows for this invocation. Managed through careful context assembly — we summarise older turns rather than truncate them.
  2. Episodic memory: A vector store of past interactions, retrieved semantically. The agent can search for relevant past actions: "have I seen a similar error before? what did I do?"
  3. Semantic memory: A structured knowledge base (could be a RAG corpus) containing domain facts, rules, and policies that the agent should reason from.

For most production agents, episodic memory is the layer most teams neglect and most benefit from. A compliance agent that can retrieve "the last time I processed a document like this, I flagged section 4.2 for review" makes far fewer redundant mistakes.

Error Recovery: Plan for Failure

Agents encounter errors constantly — API timeouts, malformed tool responses, reasoning dead-ends. How an agent handles errors is as important as how it handles success.

We build three recovery mechanisms into every agent:

  • Retry with context: On tool failure, the agent is given the error message and asked to reason about whether to retry, choose an alternative tool, or escalate.
  • Max step limit: Every agent has a hard cap on total actions per invocation (typically 25–50). Agents that loop are expensive and dangerous.
  • Graceful degradation: When an agent cannot complete a task with confidence, it produces a partial result with explicit uncertainty markers rather than a hallucinated confident answer.
MAX_STEPS = 30

for step in range(MAX_STEPS):
    result = agent.step(context)

    if result.status == "complete":
        return result.output

    if result.status == "needs_human":
        return escalate_to_human(result.partial_output, result.reason)

    if result.status == "tool_error":
        context = context.append_error(result.error)
        # Agent will reason about the error on next step
        continue

# Exceeded max steps — escalate rather than guess
return escalate_to_human(context.partial_output, "max_steps_exceeded")

Human-in-the-Loop Checkpoints

The most important lesson from our production deployments: not all decisions should be fully autonomous. The engineering goal is not to remove humans entirely but to remove humans from decisions that are routine, low-stakes, and high-volume — and keep humans in the loop for decisions that are irreversible, high-stakes, or outside the agent's confidence envelope.

We classify agent actions into three tiers:

  • Tier 1 — Fully autonomous: Read-only actions, low-stakes writes (draft creation, internal logging). No checkpoint required.
  • Tier 2 — Supervised: External communications, financial transactions, configuration changes. Agent proposes action; human approves in a lightweight UI (a Slack message with approve/reject buttons).
  • Tier 3 — Human-led: Legal filings, regulatory submissions, anything irreversible at scale. Agent provides analysis and a recommended action; human executes.

This classification doesn't limit agent value — it focuses it. Our compliance agent handles 300+ document reviews per day autonomously (Tier 1) while routing the 8–12 genuinely ambiguous cases to a human reviewer (Tier 2). That's a 96% reduction in human review volume with zero reduction in accuracy on the edge cases.

Evaluation Frameworks for Agents

Testing agents is qualitatively harder than testing deterministic software. You can't just assert on an output — you need to evaluate trajectories (was the sequence of steps reasonable?) and outcomes (did the agent accomplish the stated goal?).

We use a combination of:

  • Trajectory evaluation: LLM-as-judge scoring each step in the agent's reasoning chain for relevance, accuracy, and efficiency. We use Claude Opus 4 as the judge for this.
  • End-to-end task success rate: Against a golden set of 50–100 tasks with known correct outcomes. Track this over time; regressions require root-cause analysis before deployment.
  • Tool call accuracy: Did the agent call the right tools with the right parameters? Measured separately from whether the tools returned useful results.
  • Hallucination rate: For agents that synthesise information from retrieved context, we run RAGAS faithfulness scores on a sample of outputs daily.

Target end-to-end task success rates by tier: Tier 1 tasks ≥ 95%, Tier 2 tasks ≥ 85% (with human confirmation catching the remainder), overall hallucination rate < 2%.

Common Failure Modes We've Hit

  • Tool result hallucination: Agent claims a tool returned something it didn't. Fixed by strict output parsing and schema validation on all tool return values.
  • Context window exhaustion: Agent loses its place in long tasks. Fixed by periodic context summarisation and session checkpointing.
  • Goal drift: Agent successfully completes a sub-task and then generates unnecessary additional steps. Fixed by explicit "done" conditions in the system prompt and max-step limits.
  • Cascading failures: One failed tool call causes the agent to spiral into increasingly confused reasoning. Fixed by structured error context and a "confused state" detector that escalates rather than continuing.
  • Prompt injection via tool output: Malicious content in a tool's return value attempts to override agent instructions. Fixed by sandboxed tool output formatting and explicit trust boundaries in the system prompt.

Key Takeaways

  • ReAct for short-horizon interactive tasks; Plan-and-Execute for long-horizon batch tasks.
  • Tool quality matters more than model choice. Invest in rich error return types and single-responsibility interfaces.
  • Build three-tier memory (working, episodic, semantic) — episodic memory has the highest ROI for production agents.
  • Hard max-step limits and structured error recovery are not optional. Agents that loop are expensive and dangerous.
  • Human-in-the-loop is a feature, not a failure. Classify decisions by reversibility and stake; automate Tier 1, supervise Tier 2.
  • Measure trajectory quality, task success rate, and hallucination rate — all three independently.
Let's build something

Want to work together?

If this article made you think about your architecture, your roadmap, or a problem you haven't solved yet — let's talk.