In traditional software, code is the source of truth. For AI agents, behavior is non-deterministic, driven by the black-box model. As a result, runtime traces—which show the agent's step-by-step context and decisions—become the essential artifact for debugging, testing, and collaboration, more so than the code itself.

Related Insights

A cutting-edge pattern involves AI agents using a CLI to pull their own runtime failure traces from monitoring tools like Langsmith. The agent can then analyze these traces to diagnose errors and modify its own codebase or instructions to prevent future failures, creating a powerful, human-supervised self-improvement loop.

AI interactions often involve multiple steps (e.g., user prompt, tool calls, retrieval). When an error occurs, the entire chain can fail. The most efficient debugging heuristic is to analyze the sequence and stop at the very first mistake. Focusing on this "most upstream problem" addresses the root cause, as downstream failures are merely symptoms.

Traditional software relies on predictable, deterministic functions. AI agents introduce a new paradigm of "stochastic subroutines," where correctness and logic are abdicated. This means developers must design systems that can achieve reliable outcomes despite the non-deterministic paths the AI might take to get there.

Rather than programming AI agents with a company's formal policies, a more powerful approach is to let them observe thousands of actual 'decision traces.' This allows the AI to discover the organization's emergent, de facto rules—how work *actually* gets done—creating a more accurate and effective world model for automation.

The effectiveness of enterprise AI agents is limited not by data access, but by the absence of context for *why* decisions were made. 'Context graphs' aim to solve this by capturing 'decision traces'—exceptions, precedents, and overrides that currently live in Slack threads and employee's heads, creating a true source of truth for automation.

Treating AI evaluation like a final exam is a mistake. For critical enterprise systems, evaluations should be embedded at every step of an agent's workflow (e.g., after planning, before action). This is akin to unit testing in classic software development and is essential for building trustworthy, production-ready agents.

OpenAI identifies agent evaluation as a key challenge. While they can currently grade an entire task's trace, the real difficulty lies in evaluating and optimizing the individual steps within a long, complex agentic workflow. This is a work-in-progress area critical for building reliable, production-grade agents.

You don't need a sophisticated and expensive AI observability platform to start doing evals. The most critical first step is logging traces. This can be done simply by writing to a CSV, JSON, or text file. The key is to begin taking notes on your traces, not to implement the perfect tool.

Treat accountability as an engineering problem. Implement a system that logs every significant AI action, decision path, and triggering input. This creates an auditable, attributable record, ensuring that in the event of an incident, the 'why' can be traced without ambiguity, much like a flight recorder after a crash.

Historically, developer tools adapted to a company's codebase. The productivity gains from AI agents are so significant that the dynamic has flipped: for the first time, companies are proactively changing their code, logging, and tooling to be more 'agent-friendly,' rather than the other way around.