OpenAI identifies agent evaluation as a key challenge. While they can currently grade an entire task's trace, the real difficulty lies in evaluating and optimizing the individual steps within a long, complex agentic workflow. This is a work-in-progress area critical for building reliable, production-grade agents.

Related Insights

Training AI agents to execute multi-step business workflows demands a new data paradigm. Companies create reinforcement learning (RL) environments—mini world models of business processes—where agents learn by attempting tasks, a more advanced method than simple prompt-completion training (SFT/RLHF).

Treating AI evaluation like a final exam is a mistake. For critical enterprise systems, evaluations should be embedded at every step of an agent's workflow (e.g., after planning, before action). This is akin to unit testing in classic software development and is essential for building trustworthy, production-ready agents.

Instead of manually refining a complex prompt, create a process where an AI agent evaluates its own output. By providing a framework for self-critique, including quantitative scores and qualitative reasoning, the AI can iteratively enhance its own system instructions and achieve a much stronger result.

Beyond supervised fine-tuning (SFT) and human feedback (RLHF), reinforcement learning (RL) in simulated environments is the next evolution. These "playgrounds" teach models to handle messy, multi-step, real-world tasks where current models often fail catastrophically.

The primary bottleneck in improving AI is no longer data or compute, but the creation of 'evals'—tests that measure a model's capabilities. These evals act as product requirement documents (PRDs) for researchers, defining what success looks like and guiding the training process.

To improve the quality and accuracy of an AI agent's output, spawn multiple sub-agents with competing or adversarial roles. For example, a code review agent finds bugs, while several "auditor" agents check for false positives, resulting in a more reliable final analysis.

Borrowing from classic management theory, the most effective way to use AI agents is to fix problems at the earliest 'lowest value stage'. This means rigorously reviewing the agent's proposed plan *before* it writes any code, preventing costly rework later on.

Replit's leap in AI agent autonomy isn't from a single superior model, but from orchestrating multiple specialized agents using models from various providers. This multi-agent approach creates a different, faster scaling paradigm for task completion compared to single-model evaluations, suggesting a new direction for agent research.

While AI models excel at gathering and synthesizing information ('knowing'), they are not yet reliable at executing actions in the real world ('doing'). True agentic systems require bridging this gap by adding crucial layers of validation and human intervention to ensure tasks are performed correctly and safely.

The primary obstacle to creating a fully autonomous AI software engineer isn't just model intelligence but "controlling entropy." This refers to the challenge of preventing the compounding accumulation of small, 1% errors that eventually derail a complex, multi-step task and get the agent irretrievably off track.

Evaluating Multi-Step Agentic Traces is a Major Unsolved Problem in AI | RiffOn