In traditional software, code is the source of truth. For AI agents, behavior is non-deterministic, driven by the black-box model. As a result, runtime traces—which show the agent's step-by-step context and decisions—become the essential artifact for debugging, testing, and collaboration, more so than the code itself.
"Context Engineering" is the critical practice of managing information fed to an LLM, especially in multi-step agents. This includes techniques like context compaction, using sub-agents, and managing memory. Harrison Chase considers this discipline more crucial than prompt engineering for building sophisticated agents.
According to Harrison Chase, providing agents with file system access is critical for long-horizon tasks. It serves as a powerful context management tool, allowing the agent to save large tool outputs or conversation histories to files, then retrieve them as needed, effectively bypassing context window limitations.
Early agent development used simple frameworks ("scaffolds") to structure model interactions. As LLMs grew more capable, the industry moved to "harnesses"—more opinionated, "batteries-included" systems that provide default tools (like planning and file systems) and handle complex tasks like context compaction automatically.
While foundation model companies build effective agent harnesses, they don't necessarily dominate. Independent startups focused on coding agents often top public benchmarks (e.g., Terminal Bench 2). This demonstrates that harness engineering is a specialized skill separate from and not exclusive to model creation.
Long-horizon agents are not yet reliable enough for full autonomy. Their most effective current use cases involve generating a "first draft" of a complex work product, like a code pull request or a financial report. This leverages their ability to perform extensive work while keeping a human in the loop for final validation and quality control.
Traditional software development iterates on a known product based on user feedback. In contrast, agent development is more fundamentally iterative because you don't fully know an agent's capabilities or failure modes until you ship it. The initial goal of iteration is simply to understand and shape what the agent *does*.
A cutting-edge pattern involves AI agents using a CLI to pull their own runtime failure traces from monitoring tools like Langsmith. The agent can then analyze these traces to diagnose errors and modify its own codebase or instructions to prevent future failures, creating a powerful, human-supervised self-improvement loop.
Long-horizon agents, which can run for hours or days, require a dual-mode UI. Users need an asynchronous way to manage multiple running agents (like a Jira board or inbox). However, they also need to seamlessly switch to a synchronous chat interface to provide real-time feedback or corrections when an agent pauses or finishes.
