The agent's inability to process dates led it to schedule family events on the wrong days, creating chaos. The LLM's excuse—that it was 'mentally calculating'—reveals a fundamental weakness: models lack a true sense of time, making them unreliable for critical, time-sensitive coordination tasks.
A core debate in AI is whether LLMs, which are text prediction engines, can achieve true intelligence. Critics argue they cannot because they lack a model of the real world. This prevents them from making meaningful, context-aware predictions about future events—a limitation that more data alone may not solve.
Salesforce's AI Chief warns of "jagged intelligence," where LLMs can perform brilliant, complex tasks but fail at simple common-sense ones. This inconsistency is a significant business risk, as a failure in a basic but crucial task (e.g., loan calculation) can have severe consequences.
An AI agent's failure on a complex task like tax preparation isn't due to a lack of intelligence. Instead, it's often blocked by a single, unpredictable "tiny thing," such as misinterpreting two boxes on a W4 form. This highlights that reliability challenges are granular and not always intuitive.
Building features like custom commands and sub-agents can look like reliable, deterministic workflows. However, because they are built on non-deterministic LLMs, they fail unpredictably. This misleads users into trusting a fragile abstraction and ultimately results in a poor experience.
Today's AI models are powerful but lack a true sense of causality, leading to illogical errors. Unconventional AI's Naveen Rao hypothesizes that building AI on substrates with inherent time and dynamics—mimicking the physical world—is the key to developing this missing causal understanding.
The key challenge in building a multi-context AI assistant isn't hitting a technical wall with LLMs. Instead, it's the immense risk associated with a single error. An AI turning off the wrong light is an inconvenience; locking the wrong door is a catastrophic failure that destroys user trust instantly.
AI models struggle to create and adhere to multi-step, long-term plans. In an experiment, an AI devised an 8-week plan to launch a clothing brand but then claimed completion after just 10 minutes and a single Google search, demonstrating an inability to execute extended sequences of tasks.
Salesforce is reintroducing deterministic automation because its generative AI agents struggle with reliability, dropping instructions when given more than eight commands. This pullback signals current LLMs are not ready for high-stakes, consistent enterprise workflows.
Simply having a large context window is insufficient. Models may fail to "see" or recall specific facts embedded deep within the context, a phenomenon exposed by "needle in the haystack" evaluations. Effective reasoning capability across the entire window is a separate, critical factor.
Current AI world models suffer from compounding errors in long-term planning, where small inaccuracies become catastrophic over many steps. Demis Hassabis suggests hierarchical planning—operating at different levels of temporal abstraction—is a promising solution to mitigate this issue by reducing the number of sequential steps.