Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

Bending Spoons avoids massive token costs by using self-hosted, narrow-purpose AI models for specific tasks. An internal AI orchestrator routes jobs to the most cost-effective model, reserving expensive frontier models only for complex tasks or supervision. This strategy allows them to generate over 90% of their code with AI at a modest cost.

Related Insights

Enterprises are currently overspending on tokens by sending all queries to the most powerful LLMs. A new software category will emerge to intelligently route requests to smaller, cheaper models when possible, creating a critical efficiency and cost-saving layer between companies and foundational model providers.

An effective cost-saving strategy for agentic workflows is to use a powerful model like Claude Opus to perform a complex task once and generate a detailed 'skill.' This skill can then be reliably executed by a much cheaper and faster model like Sonnet for subsequent use.

Relying solely on premium models like Claude Opus can lead to unsustainable API costs ($1M/year projected). The solution is a hybrid approach: use powerful cloud models for complex tasks and cheaper, locally-hosted open-source models for routine operations.

In response to budget blowouts from agentic AI, enterprises are moving beyond simple adoption to active cost management. A new "token efficiency" stack is emerging, featuring tactics like model routing to cheaper alternatives (e.g., DeepSeek) and custom post-trained models to reduce reliance on expensive foundation models.

Companies are building intelligent systems that analyze a user's prompt and automatically route it to the most cost-effective model that can handle the task. This avoids using expensive frontier models for simple requests, with some companies like Coinbase successfully keeping costs flat despite exponential usage growth.

By training a smaller, specialized model where company data is in the weights, firms avoid the high token costs of repeatedly feeding context to large frontier models. This makes complex, data-intensive workflows significantly cheaper and faster.

To optimize costs, users configure powerful models like Claude Opus as the 'brain' to strategize and delegate execution tasks (e.g. coding) to cheaper, specialized models like ChatGPT's Codec, treating them as muscles.

To optimize AI costs in development, use powerful, expensive models for creative and strategic tasks like architecture and research. Once a solid plan is established, delegate the step-by-step code execution to less powerful, more affordable models that excel at following instructions.

To prevent AI agent usage costs from spiraling, GitHub expects the solution will be intelligent model routing. These systems will automatically select the most efficient and cost-effective AI model for a given task, such as using a cheap model for simple refactoring instead of a powerful, expensive one.

To control inference costs, companies are implementing model routing systems. They differentiate between expensive tokens from frontier models for complex reasoning and cheaper tokens from fine-tuned open-source models for simpler workflow tasks. This tiered approach optimizes both performance and budget, avoiding "token maxing."

Bending Spoons Achieves 90% AI-Written Code With Cheap, Narrow Models | RiffOn