Why I log every tool call now
Three quarters of incident debugging time used to come from not knowing what the AI did. Logging every tool call into a queryable store changed that, small infrastructure investment, outsized return.
A specific habit that’s emerged over the year: every tool call from any agent in my setup gets logged to a queryable store. Not just the ones that fail, and not just the ones that involve sensitive data. Every one. The infrastructure to do this is small. The operational return (when something goes wrong, when I need to debug, when I want to recalibrate) has been outsized.
Worth being explicit about the practice because it’s the kind of small investment that produces compounding value the longer it runs.
What gets logged
For every tool call the agent makes:
- Conversation ID and turn number. The atomic unit, as I argued earlier, the conversation context the call lives in.
- Agent identity. Which agent made the call, which version, which deployment.
- Tool name and arguments. The complete structured request.
- Tool response. The complete structured response, including success/failure flags and any error details.
- Latency. How long the call took end-to-end.
- Cost attribution. Token costs (for model-class tool calls) or estimated infrastructure cost (for action tool calls).
- Decision context. The reasoning the agent provided for why it was making this call (where the platform exposes it).
- User context. Which user the conversation is on behalf of, what scope they have.
That’s the record. Structured JSON, indexed by conversation ID, queryable by all the relevant axes. Small Postgres database. Maybe a thousand lines of code in the gateway that captures it.
What it solves
A few categories of problem this routine prevents from being painful:
Incident debugging. When something goes wrong, the first question is “what did the agent actually do.” Without logs, this is detective work. With logs, it’s a query. The MTTR on AI-related incidents in my setup dropped meaningfully once this was in place.
The hallucinated-existence failure mode. When an agent describes calling a tool that doesn’t exist or returning data that wasn’t actually returned, the logs are the source of truth. The agent’s narrative goes one way; the logs go the other; the logs win.
Calibration over time. Looking at the log over months reveals patterns the in-the-moment view doesn’t. Which tools the agent over-uses and which ones it under-uses. Which arguments produce errors, and which sequences of tool calls predict bad outcomes.
Audit and compliance. When the auditor asks “what did your AI system do over the past quarter,” the answer is a query rather than a scramble. The SOC 2 conversation gets meaningfully easier with proper logs.
Cost surprises. When the bill is higher than expected, the logs explain why. Which conversations spiked. Which tool calls were the expensive ones. Which agents are running away.
Trust calibration. When deciding whether to extend an agent’s autonomy or constrain it, the historical record of its tool-use patterns is the right input. Without the record, the decision is gut-feel.
These all happen in any non-trivial AI deployment. The logs make each one tractable.
The infrastructure
Concrete shape of the logging setup that’s been working:
A small Postgres table with the structured log records. One row per tool call. Indexed by (conversation_id), (agent_id, timestamp), and (tool_name, timestamp).
An interceptor in the gateway layer that captures every tool call and response before passing them through to the agent. The interceptor is the place to do this; trying to capture from the agent code is more fragile.
A retention policy. Logs persist for a defined period (in my case, 18 months) then get archived. Old logs are still accessible but not in the hot index.
A small query interface. Either a Grafana dashboard for the common views or a CLI for ad-hoc queries. Both are useful for different patterns.
A redaction pass on sensitive arguments. When the tool arguments contain PII or credentials, the logged version has them redacted. The original goes nowhere.
That’s the setup. Maybe a week of platform-engineering work to build cleanly. Once built, it runs without attention.
The patterns the logs reveal
A few things looking at six months of logs has shown me about how my agents actually behave:
Tool-call sequences cluster. Most user requests trigger one of maybe 8-10 distinct tool-call sequences. The variation between sequences is small. The “agents do anything they need to” mental model is overstated, they mostly do the same things in the same orders.
Errors concentrate in a few tools. Out of the 30+ tools agents have access to in my setup, four account for 70% of the tool-call errors. Targeted reliability investment in those four reduced the error rate meaningfully.
Agent behavior drifts seasonally. The error rate on the same workloads tracks model updates and prompt changes. The logs make the drift visible; without them, the drift is invisible.
Some agents over-use specific tools. A tool that’s right for some situations gets used in many situations where it isn’t quite right. The pattern is visible in the logs and addressable in the agent prompts.
Cost outliers are predictable. A small number of conversation patterns (long-running agentic loops, certain domain queries that trigger many retrievals) account for the cost-tail outliers. Knowing which patterns lets you decide whether to constrain them or accept the cost.
These are the kinds of insights that don’t happen without the logs. The logs make them queryable; the queries inform decisions.
What I’d recommend
For teams running production AI workloads:
- Capture every tool call. Not just errors, not just sensitive ones. Every one.
- Index by conversation ID. The query patterns that matter most all start there.
- Log in structured JSON. Free-text logs are only useful for human reading; structured logs are useful for everything.
- Build the retention story up front. The “let’s worry about retention later” path produces either out-of-control storage costs or accidental data loss.
- Redact at write time. Sensitive fields shouldn’t reach the log store in the first place.
- Provide a query interface. A dashboard, a CLI, both, make the logs actually queryable rather than write-only.
The logging investment is small. The return compounds. Deployments that do this well have markedly better operational hygiene than the ones that don’t. The ones who don’t usually spend a few painful incidents wishing they had the data, and then build the logging anyway.
Worth doing before the painful incident. The pattern is well-understood, the infrastructure is small, the value compounds. One of the most under-investment’d practices in production AI in 2025.
Log every tool call. The future-you who’s debugging at 2am will thank the present-you who set this up.