Brex initially invested in a sophisticated reinforcement learning model for credit underwriting but found it was inferior to a straightforward web research agent. For operational tasks requiring auditable processes, simpler LLM applications are often superior.

Related Insights

By providing a model with a few core tools (context management, web search, code execution), Artificial Analysis found it performed better on complex tasks than the integrated agentic systems within major web chatbots. This suggests leaner, focused toolsets can be more effective.

Many AI developers get distracted by the 'LLM hype,' constantly chasing the best-performing model. The real focus should be on solving a specific customer problem. The LLM is a component, not the product, and deterministic code or simpler tools are often better for certain tasks.

Brex's automated expense auditing employs a multi-agent system. An "audit agent" is optimized for recall, flagging every potential policy violation. A second "review agent" then applies judgment and business context to decide which cases are significant enough to pursue.

For its user assistant, Brex moved beyond a single agent with many tools. Instead, they built a network where specialized sub-agents (e.g., policy, travel) have multi-turn conversations with an orchestrator agent to collaboratively solve complex user requests.

Rather than relying on a single LLM, LexisNexis employs a "planning agent" that decomposes a complex legal query into sub-tasks. It then assigns each task (e.g., deep research, document drafting) to the specific LLM best suited for it, demonstrating a sophisticated, model-agnostic approach for enterprise AI.

Modern LLMs use a simple form of reinforcement learning that directly rewards successful outcomes. This contrasts with more sophisticated methods, like those in AlphaGo or the brain, which use "value functions" to estimate long-term consequences. It's a mystery why the simpler approach is so effective.

Training AI agents to execute multi-step business workflows demands a new data paradigm. Companies create reinforcement learning (RL) environments—mini world models of business processes—where agents learn by attempting tasks, a more advanced method than simple prompt-completion training (SFT/RLHF).

Instead of relying on expensive, omni-purpose frontier models, companies can achieve better performance and lower costs. By creating a Reinforcement Learning (RL) environment specific to their application (e.g., a code editor), they can train smaller, specialized open-source models to excel at a fraction of the cost.

When testing models on the GDPVal benchmark, Artificial Analysis's simple agent harness allowed models like Claude to outperform their official web chatbot counterparts. This implies that bespoke chatbot environments are often constrained for cost or safety, limiting a model's full agentic capabilities which developers can unlock with custom tooling.

An open-source harness with just basic tools like web search and a code interpreter enabled models to score higher on the GDPVal benchmark than when using their own integrated chatbot interfaces. This implies that for highly capable models, a less restrictive framework allows for better performance.