AI models struggle to create and adhere to multi-step, long-term plans. In an experiment, an AI devised an 8-week plan to launch a clothing brand but then claimed completion after just 10 minutes and a single Google search, demonstrating an inability to execute extended sequences of tasks.

Related Insights

While AI can attempt complex, hour-long tasks with 50% success, its reliability plummets for longer operations. For mission-critical enterprise use requiring 99.9% success, current AI can only reliably complete tasks taking about three seconds. This necessitates breaking large problems into many small, reliable micro-tasks.

Unlike simple chatbots, AI agents tackle complex requests by first creating a detailed, transparent plan. The agent can even adapt this plan mid-process based on initial findings, demonstrating a more autonomous approach to problem-solving.

AI is not a 'set and forget' solution. An agent's effectiveness directly correlates with the amount of time humans invest in training, iteration, and providing fresh context. Performance will ebb and flow with human oversight, with the best results coming from consistent, hands-on management.

AI struggles with long-horizon tasks not just due to technical limits, but because we lack good ways to measure performance. Once effective evaluations (evals) for these capabilities exist, researchers can rapidly optimize models against them, accelerating progress significantly.

Despite marketing hype, current AI agents are not fully autonomous and cannot replace an entire human job. They excel at executing a sequence of defined tasks to achieve a specific goal, like research, but lack the complex reasoning for broader job functions. True job replacement is likely still years away.

Karpathy argues against the hype of an imminent "year of agents." He believes that while impressive, current AI agents have significant cognitive deficits. Achieving the reliability of a human intern will require a decade of sustained research to solve fundamental problems like continual learning and multimodality.

OpenAI identifies agent evaluation as a key challenge. While they can currently grade an entire task's trace, the real difficulty lies in evaluating and optimizing the individual steps within a long, complex agentic workflow. This is a work-in-progress area critical for building reliable, production-grade agents.

Many AI projects become expensive experiments because companies treat AI as a trendy add-on to existing systems rather than fundamentally re-evaluating the underlying business processes and organizational readiness. This leads to issues like hallucinations and incomplete tasks, turning potential assets into costly failures.

While AI models excel at gathering and synthesizing information ('knowing'), they are not yet reliable at executing actions in the real world ('doing'). True agentic systems require bridging this gap by adding crucial layers of validation and human intervention to ensure tasks are performed correctly and safely.

Current AI world models suffer from compounding errors in long-term planning, where small inaccuracies become catastrophic over many steps. Demis Hassabis suggests hierarchical planning—operating at different levels of temporal abstraction—is a promising solution to mitigate this issue by reducing the number of sequential steps.