Unlike coding, where context is centralized (IDE, repo) and output is testable, general knowledge work is scattered across apps. AI struggles to synthesize this fragmented context, and it's hard to objectively verify the quality of its output (e.g., a strategy memo), limiting agent effectiveness.
Current LLMs are intelligent enough for many tasks but fail because they lack access to complete context—emails, Slack messages, past data. The next step is building products that ingest this real-world context, making it available for the model to act upon.
The primary obstacle for tools like OpenAI's Atlas isn't technical capability but the user's workload. The time, effort, and security risk required to verify an AI agent's autonomous actions often exceed the time it would take for a human to perform the task themselves, limiting practical use cases.
Off-the-shelf AI models can only go so far. The true bottleneck for enterprise adoption is "digitizing judgment"—capturing the unique, context-specific expertise of employees within that company. A document's meaning can change entirely from one company to another, requiring internal labeling.
AI struggles with tasks requiring long and wide context, like software engineering. Because adding a linear amount of context requires an exponential increase in compute power, it cannot effectively manage the complex interdependencies of large projects.
A critical learning at LinkedIn was that pointing an AI at an entire company drive for context results in poor performance and hallucinations. The team had to manually curate "golden examples" and specific knowledge bases to train agents effectively, as the AI couldn't discern quality on its own.
OpenAI identifies agent evaluation as a key challenge. While they can currently grade an entire task's trace, the real difficulty lies in evaluating and optimizing the individual steps within a long, complex agentic workflow. This is a work-in-progress area critical for building reliable, production-grade agents.
AI-generated "work slop"—plausible but low-substance content—arises from a lack of specific context. The cure is not just user training but building systems that ingest and index a user's entire work graph, providing the necessary grounding to move from generic drafts to high-signal outputs.
Research highlights "work slop": AI output that appears polished but lacks human context. This forces coworkers to spend significant time fixing it, effectively offloading cognitive labor and damaging perceptions of the sender's capability and trustworthiness.
While AI models excel at gathering and synthesizing information ('knowing'), they are not yet reliable at executing actions in the real world ('doing'). True agentic systems require bridging this gap by adding crucial layers of validation and human intervention to ensure tasks are performed correctly and safely.
Research shows employees are rapidly adopting AI agents. The primary risk isn't a lack of adoption but that these agents are handicapped by fragmented, incomplete, or siloed data. To succeed, companies must first focus on creating structured, centralized knowledge bases for AI to leverage effectively.