To reliably translate a natural language policy into formal logic, Amazon's system generates multiple translations using an LLM. It then employs a theorem prover to verify these translations are logically equivalent. Mismatches trigger a clarification loop with the user, ensuring the final specification is correct before checking an agent's work.

Related Insights

Instead of creating rigid systems, formalizing policies makes rules transparent and debatable. It allows for building explicit exceptions, where the final "axiom" in a logical system can simply be "go talk to a human." This preserves necessary flexibility and discretion while making the process auditable and clear.

Rather than relying on a single LLM, LexisNexis employs a "planning agent" that decomposes a complex legal query into sub-tasks. It then assigns each task (e.g., deep research, document drafting) to the specific LLM best suited for it, demonstrating a sophisticated, model-agnostic approach for enterprise AI.

While AI can generate code, the stakes on blockchain are too high for bugs, as they lead to direct financial loss. The solution is formal verification, using mathematical proofs to guarantee smart contract correctness. This provides a safety net, enabling users and AI to confidently build and interact with financial applications.

Implement human-in-the-loop checkpoints using a simple, fast LLM as a 'generative filter.' This agent's sole job is to interpret natural language feedback from a human reviewer (e.g., in Slack) and translate it into a structured command ('ship it' or 'revise') to trigger the correct automated pathway.

Treating AI evaluation like a final exam is a mistake. For critical enterprise systems, evaluations should be embedded at every step of an agent's workflow (e.g., after planning, before action). This is akin to unit testing in classic software development and is essential for building trustworthy, production-ready agents.

To ensure reliability in healthcare, ZocDoc doesn't give LLMs free rein. It wraps them in a hybrid system where traditional, deterministic code orchestrates the AI's tasks, sets firm boundaries, and knows when to hand off to a human, preventing the 'praying for the best' approach common with direct LLM use.

While the computational problem of finding a proof is intractable, the real-world bottleneck is the human process of defining the specification. Getting stakeholders to agree on what a property like "all data at rest is encrypted" truly means requires intense negotiation and is by far the most difficult part.

To improve the quality and accuracy of an AI agent's output, spawn multiple sub-agents with competing or adversarial roles. For example, a code review agent finds bugs, while several "auditor" agents check for false positives, resulting in a more reliable final analysis.

The prompts for your "LLM as a judge" evals function as a new form of PRD. They explicitly define the desired behavior, edge cases, and quality standards for your AI agent. Unlike static PRDs, these are living documents, derived from real user data and are constantly, automatically testing if the product meets its requirements.

To prevent AI coding assistants from hallucinating, developer Terry Lynn uses a two-step process. First, an AI generates a Product Requirements Document (PRD). Then, a separate AI "reviewer" rates the PRD's clarity out of 10, identifying gaps before any code is written, ensuring a higher rate of successful execution.