Instead of supervising an AI's hidden thought process, we can demand it produces a 'certificate of reasoning'—a checkable proof—along with its output. This could include citations or sensitivity analyses, shifting verification from observing the process to checking the provided proof.
Elicit built a Domain-Specific Language (DSL) defining reasoning primitives as microservices. Frontier models orchestrate these primitives to create structured workflows, ensuring complex processes run exactly as defined and overcoming the inherent unreliability of standard LLMs for high-stakes tasks.
For users in life sciences, an AI tool's value lies not just in its power but its ability to apply the exact same reasoning process consistently over thousands of data points. Elicit guarantees the 9,999th item is analyzed identically to the 5th, providing trust at scale.
Humans rely on lossy proxies like journal prestige and citation counts to judge research. AI enables a shift to evaluating the work's content directly—methodology, sample size, and logical coherence—for a more accurate assessment of evidence quality tailored to a specific question.
As AI masters content generation, it will handle the "blank page" problem. The crucial human task will then shift from creation to evaluation: defining what 'good' looks like, identifying AI failure modes, and building better verification systems to ensure outputs are trustworthy and useful.
To manage costs, the optimal architecture isn't running everything on the most powerful model. Instead, a smart orchestrator agent should break down complex problems and dispatch simpler sub-tasks to smaller, cheaper models, optimizing for both cost and performance.
The benefit of discrete reasoning (like generating tokens or tool calls) over a continuous 'neuralese' is error correction, analogous to why digital computing beat analog. A slightly wrong token can be 'rounded' to the correct one, preventing the compounding errors that would plague a purely continuous process.
Unlike a human expert, an LLM's probability estimates and conclusions can be drastically altered by simple rephrasing or irrelevant suggestions. This instability shows they are too easily "pushed around" and lack the coherent world model necessary for trustworthy, high-stakes decision support.
Instead of relying on opaque model weights, continual learning is more reliably achieved by having AI build explicit, external 'world models' like knowledge graphs. This approach makes the model's understanding inspectable and correctable by humans, enabling more robust causal analysis.
If the AI community prioritizes truth-seeking over persuasive-sounding outputs, it could create a virtuous cycle. A more truth-seeking AI would better identify the most important interventions to improve its own reasoning, leading to a feedback loop that rapidly enhances epistemic quality.
Elicit's system, 'The Line,' automates the full software development lifecycle. It takes feature requests initiated by a Slack emoji, then handles speccing, implementation, video-based testing, code review, and merging to production, calling for human intervention only when necessary.
When asked to analyze 100 papers, LLMs often admit they didn't complete the task. This failure stems from outcome-based training, which prioritizes a plausible-looking final output over correctly following the required process, revealing a fundamental flaw in current training paradigms.
