For high-stakes operations like changing a flight, any AI hallucination is a catastrophic failure. This necessity for 100% accuracy in a complex vertical like travel forced Navan to build its own proprietary, agentic AI platform rather than relying on external models which could result in customer loss and lawsuits.
The inconsistency and 'laziness' of base LLMs is a major hurdle. The best application-layer companies differentiate themselves not by just wrapping a model, but by building a complex harness that ensures the right amount of intelligence is reliably applied to a specific user task, creating a defensible product.
Fully autonomous agents are not yet reliable for complex production use cases because accuracy collapses when chaining multiple probabilistic steps. Zapier's CEO recommends a hybrid "agentic workflow" approach: embed a single, decisive agent within an otherwise deterministic, structured workflow to ensure reliability while still leveraging LLM intelligence.
For specialized, high-stakes tasks like insurance underwriting, enterprises will favor smaller, on-prem models fine-tuned on proprietary data. These models can be faster, more accurate, and more secure than general-purpose frontier models, creating a lasting market for custom AI solutions.
Navan's CEO sees the debate over which LLM is best as unimportant because the infrastructure is becoming a commodity. The real value is created in the application layer. Navan's own agentic platform, Cognition, intelligently routes tasks to different models (OpenAI, Anthropic, Google) to get the best result for the job.
For applications in banking, insurance, or healthcare, reliability is paramount. Startups that architect their systems from the ground up to prevent hallucinations will have a fundamental advantage over those trying to incrementally reduce errors in general-purpose models.
Anyone can build a simple "hackathon version" of an AI agent. The real, defensible moat comes from the painstaking engineering work to make the agent reliable enough for mission-critical enterprise use cases. This "schlep" of nailing the edge cases is a barrier that many, including big labs, are unmotivated to cross.
Instead of starting with simple generative AI tasks, Airbnb focused on the most difficult application: resolving urgent customer issues like lockouts. This high-stakes approach allowed them to build a robust agent that can now be applied to less critical, "up-funnel" use cases like travel planning.
Relying solely on natural language prompts like 'always do this' is unreliable for enterprise AI. LLMs struggle with deterministic logic. Salesforce developed 'AgentForce Script,' a dedicated language to enforce rules and ensure consistent, repeatable performance for critical business workflows, blending it with LLM reasoning.
AI agents are simply 'context and actions.' To prevent hallucination and failure, they must be grounded in rich context. This is best provided by a knowledge graph built from the unique data and metadata collected across a platform, creating a powerful, defensible moat.
To prevent AI agents from over-promising or inventing features, you must explicitly define negative constraints. Just as you train them on your capabilities, provide clear boundaries on what your product or service does not do to stop them from making things up to be helpful.