Building AI Agents with Tool Use: From Concept to Production

TLDR: AI agents fail in production because of scaffolding, not models. The ReAct pattern, tool design principles, multi-agent orchestration, idempotency, max iteration guards, observability — all the things I learned the hard way across three agents before one finally shipped cleanly.

I've built three AI agents that made it to production. Two of them got rewritten before launch because they broke in ways I didn't anticipate. The third one works well, and the difference wasn't the model — it was the scaffolding I built around it.

An LLM that only generates text is a very expensive search engine. An LLM with tools — the ability to call APIs, read files, run code, take real actions — is qualitatively different. It's an agent. And building an agent that actually works in production means solving a different set of problems than building an LLM integration.

What an agent actually is

An AI agent is an LLM in a loop that can:

Observe its environment (tools available, context, previous outputs)
Reason about what action to take next
Act by calling tools
Observe the result and iterate

The key shift: the model decides the control flow. You don't orchestrate step-by-step. You give the model a goal, a set of tools, and a loop — and it figures out the sequence.

┌─────────────────────────────────────────┐
│              Agent Loop                  │
│                                          │
│  [User Goal] → LLM → [Tool Call?]       │
│                  ↑          ↓           │
│              [Result] ← [Execute Tool]  │
│                  ↑                      │
│              [Final Answer if done]     │
└─────────────────────────────────────────┘

This is elegant in theory. In practice it's a loop that can go sideways in about fifteen different ways, which is why you need guardrails from the start.

Defining tools

Modern LLMs expose tool use as a first-class capability. You define tools as JSON schemas and the model decides when and how to call them:

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const tools: Anthropic.Tool[] = [
  {
    name: "get_weather",
    description:
      "Get the current weather for a city. Returns temperature and conditions.",
    input_schema: {
      type: "object",
      properties: {
        city: {
          type: "string",
          description: "City name, e.g. 'San Francisco, CA'",
        },
        unit: {
          type: "string",
          enum: ["celsius", "fahrenheit"],
        },
      },
      required: ["city"],
    },
  },
  {
    name: "search_web",
    description: "Search the web and return the top 5 results with summaries.",
    input_schema: {
      type: "object",
      properties: {
        query: { type: "string" },
      },
      required: ["query"],
    },
  },
];

The agentic loop

async function runAgent(userMessage: string): Promise<string> {
  const messages: Anthropic.MessageParam[] = [
    { role: "user", content: userMessage },
  ];

  while (true) {
    const response = await client.messages.create({
      model: "claude-sonnet-4-6",
      max_tokens: 4096,
      tools,
      messages,
    });

    if (response.stop_reason === "end_turn") {
      const textBlock = response.content.find((b) => b.type === "text");
      return textBlock?.text ?? "";
    }

    if (response.stop_reason === "tool_use") {
      messages.push({ role: "assistant", content: response.content });

      const toolResults: Anthropic.ToolResultBlockParam[] = [];

      for (const block of response.content) {
        if (block.type !== "tool_use") continue;

        const result = await executeTool(block.name, block.input);
        toolResults.push({
          type: "tool_result",
          tool_use_id: block.id,
          content: JSON.stringify(result),
        });
      }

      messages.push({ role: "user", content: toolResults });
    }
  }
}

The while(true) loop is intentional — the model drives the termination condition. But you absolutely need to add a max-iteration guard (more on that below).

The ReAct pattern

ReAct (Reason + Act) is the pattern that separates agents that debug well from agents that are mystery boxes.

Before each tool call, the model reasons about what it's doing and why. After each result, it reasons about what it learned. This happens in the output tokens before the tool call JSON:

Thought: The user wants to compare Q1 revenue for Stripe and Shopify.
         I should search for each company's earnings separately.

Action: search_web("Stripe Q1 2025 revenue earnings")
Observation: [search results]

Thought: I have Stripe's data. Now I need Shopify's.

Action: search_web("Shopify Q1 2025 revenue earnings")
Observation: [search results]

Thought: I have both data points. Now I can compare.
Answer: ...

Enforce it in your system prompt:

Before each tool call, write a brief "Thought:" explaining your reasoning.
After observing results, write another "Thought:" before deciding next steps.

Two practical benefits: the model makes fewer unnecessary tool calls (the reasoning step often catches "actually I already have this"), and when something goes wrong, you can read the reasoning trace and understand exactly where the agent went off track. That's enormously valuable when debugging production failures at 2am.

Tool design is more important than model choice

This took me a while to internalize. I've swapped models and gotten marginal improvements. I've redesigned tools and gotten dramatic improvements.

One tool, one responsibility:

❌  get_and_process_customer_data(customer_id)
✅  get_customer(customer_id)
✅  get_customer_orders(customer_id)
✅  update_customer_status(customer_id, status)

Narrow tools are easier for the model to use correctly. They're also easier to test in isolation and easier to add logging to.

Descriptions are contracts — write them for a very literal reader:

❌  "Gets customer info"
✅  "Retrieve a customer record by ID. Returns: { id, name, email, plan, created_at }.
    Throws CustomerNotFoundError if the ID doesn't exist. Use get_customer_orders
    to retrieve their order history separately."

The model reads your description, not your code. If the description is ambiguous, the model will guess — and guesses in agentic loops compound.

Return structured data, not sentences:

❌  "The customer was created successfully on January 15th, 2025"
✅  { "success": true, "customer_id": "cus_abc123", "created_at": "2025-01-15T10:23:00Z" }

Make errors actionable:

async function getCustomer(id: string): Promise<Customer> {
  const customer = await db.customers.findById(id);

  if (!customer) {
    return {
      error: "CUSTOMER_NOT_FOUND",
      message: `No customer found with id "${id}". Verify the ID and try again.
                To search by email, use the search_customers tool instead.`,
    };
  }

  return customer;
}

An agent that receives "Customer not found" will loop. An agent that receives the above will pivot to a different approach. The error message is part of the tool's contract with the model.

Multi-agent architectures

Single agents break down on complex, multi-domain tasks. Not because the model isn't smart enough — because the context window fills up and the agent loses track of what it was doing.

The orchestrator/subagent pattern solves this:

┌─────────────────────────────────────────────┐
│              Orchestrator Agent              │
│   (plans, delegates, synthesizes results)   │
└──────────┬──────────┬──────────┬────────────┘
           │          │          │
    ┌──────▼──┐  ┌────▼───┐  ┌──▼──────┐
    │ Research │  │ Analyst │  │  Writer │
    │  Agent   │  │  Agent  │  │  Agent  │
    └──────────┘  └────────┘  └─────────┘

Each subagent gets a fresh context window and a narrow scope. The orchestrator handles delegation and synthesis. When subagents are independent, run them in parallel:

const subtasks = [
  { agent: "researcher", task: "Find Q1 2025 Stripe revenue" },
  { agent: "researcher", task: "Find Q1 2025 Shopify revenue" },
  { agent: "data_analyst", task: "Get historical revenue trend data" },
];

const results = await Promise.all(
  subtasks.map((task) => runSubagent(task.agent, task.task))
);

Safety and reliability — design for failure from the start

I can't stress this enough: production agents need safety guardrails designed in from day one, not bolted on after something breaks.

Classify tools by risk level:

const TOOL_RISK = {
  safe: ["search_web", "get_customer", "get_weather"],
  medium: ["create_ticket", "send_email_draft", "update_record"],
  high: ["delete_record", "send_bulk_email", "execute_sql"],
};

async function executeTool(name: string, input: unknown) {
  const risk = getRiskLevel(name);

  if (risk === "high") {
    const approved = await requestHumanApproval({ tool: name, input });
    if (!approved) {
      return { error: "APPROVAL_DENIED", message: "User did not approve this action." };
    }
  }

  return toolImplementations[name](input);
}

Max iteration guard — non-negotiable:

const MAX_ITERATIONS = 15;
let iterations = 0;

while (true) {
  if (++iterations > MAX_ITERATIONS) {
    return {
      error: "MAX_ITERATIONS_EXCEEDED",
      partial_result: getPartialResult(messages),
    };
  }
}

Agents can get stuck in retry loops. Without a ceiling, they'll burn your API budget and return nothing useful.

Make tools idempotent:

// ❌ Creates a duplicate every retry
async function createOrder(items: Item[]) { ... }

// ✅ Safe to retry — checks for existing record first
async function createOrder(items: Item[], idempotencyKey: string) {
  const existing = await db.orders.findByKey(idempotencyKey);
  if (existing) return existing;
  return db.orders.create({ items, idempotencyKey });
}

MCP for production tool integration

The Model Context Protocol standardizes how tools integrate across models and hosts. Instead of writing bespoke tool implementations per model, MCP-compatible tools work everywhere:

import { Server } from "@modelcontextprotocol/sdk/server/index.js";

const server = new Server({ name: "my-tools", version: "1.0.0" });

server.setRequestHandler("tools/list", async () => ({
  tools: [
    {
      name: "query_database",
      description: "Run a read-only SQL query against the analytics database",
      inputSchema: {
        type: "object",
        properties: { sql: { type: "string" } },
        required: ["sql"],
      },
    },
  ],
}));

Once your tools are MCP servers, they work with Claude Code, Claude Desktop, and any other MCP host — write once, use everywhere.

Observability — log everything

Production agents are opaque by default. You need to make them transparent:

interface AgentTrace {
  session_id: string;
  user_query: string;
  iterations: number;
  tool_calls: {
    tool: string;
    input: unknown;
    output: unknown;
    duration_ms: number;
    error?: string;
  }[];
  final_response: string;
  total_tokens: number;
  duration_ms: number;
}

Structured traces let you debug failures, catch loop patterns before users encounter them, and attribute costs to specific features. When an agent does something unexpected in production, the trace is how you find out why.

Agents are powerful when the tools are well-designed, the loops are bounded, and the failures are observable. The models are capable enough — the engineering challenge is building the scaffolding that makes them reliable at scale. Start with a single agent and well-scoped tools. Add multi-agent orchestration only when you've hit a ceiling on context or complexity. And build observability in from day one, not when you're already debugging production.