An open-source harness with just basic tools like web search and a code interpreter enabled models to score higher on the GDPVal benchmark than when using their own integrated chatbot interfaces. This implies that for highly capable models, a less restrictive framework allows for better performance.
By providing a model with a few core tools (context management, web search, code execution), Artificial Analysis found it performed better on complex tasks than the integrated agentic systems within major web chatbots. This suggests leaner, focused toolsets can be more effective.
A practical hack to improve AI agent reliability is to avoid built-in tool-calling functions. LLMs have more training data on writing code than on specific tool-use APIs. Prompting the agent to write and execute the code that calls a tool leverages its core strength and produces better outcomes.
AI platforms using the same base model (e.g., Claude) can produce vastly different results. The key differentiator is the proprietary 'agent' layer built on top, which gives the model specific tools to interact with code (read, write, edit files). A superior agent leads to superior performance.
Early on, Google's Jules team built complex scaffolding with numerous sub-agents to compensate for model weaknesses. As models like Gemini improved, they found that simpler architectures performed better and were easier to maintain. The complex scaffolding was a temporary crutch, not a sustainable long-term solution.
Instead of giving an LLM hundreds of specific tools, a more scalable "cyborg" approach is to provide one tool: a sandboxed code execution environment. The LLM writes code against a company's SDK, which is more context-efficient, faster, and more flexible than multiple API round-trips.
Contrary to the trend toward multi-agent systems, Tasklet finds that one powerful agent with access to all context and tools is superior for a single user's goals. Splitting tasks among specialized agents is less effective than giving one generalist agent all information, as foundation models are already experts at everything.
Instead of relying on expensive, omni-purpose frontier models, companies can achieve better performance and lower costs. By creating a Reinforcement Learning (RL) environment specific to their application (e.g., a code editor), they can train smaller, specialized open-source models to excel at a fraction of the cost.
The perceived limits of today's AI are not inherent to the models themselves but to our failure to build the right "agentic scaffold" around them. There's a "model capability overhang" where much more potential can be unlocked with better prompting, context engineering, and tool integrations.
When testing models on the GDPVal benchmark, Artificial Analysis's simple agent harness allowed models like Claude to outperform their official web chatbot counterparts. This implies that bespoke chatbot environments are often constrained for cost or safety, limiting a model's full agentic capabilities which developers can unlock with custom tooling.
Criticism against AI frameworks is nuanced. High-level abstractions like `import agent` can hide complexity and make systems hard to adapt. However, low-level orchestration frameworks providing building blocks like nodes and edges are valuable for their utility (e.g., checkpointing) without sacrificing transparency.