An LLM shouldn't do math internally any more than a human would. The most intelligent AI systems will be those that know when to call specialized, reliable tools—like a Python interpreter or a search API—instead of attempting to internalize every capability from first principles.
LLMs shine when acting as a 'knowledge extruder'—shaping well-documented, 'in-distribution' concepts into specific code. They fail when the core task is novel problem-solving where deep thinking, not code generation, is the bottleneck. In these cases, the code is the easy part.
The top 1% of AI companies making significant revenue don't rely on popular frameworks like Langchain. They gain more control and performance by using small, direct LLM calls for specific application parts. This avoids the black-box abstractions of frameworks, which are more common among the other 99% of builders.
AI's strength lies in solving "differentiable" problems where being "close enough" is acceptable, like generating an image. Classical code is better for non-differentiable tasks requiring exact precision, like arithmetic or hashing. This framework helps architects decide where to deploy AI versus traditional algorithms.
High productivity isn't about using AI for everything. It's a disciplined workflow: breaking a task into sub-problems, using an LLM for high-leverage parts like scaffolding and tests, and reserving human focus for the core implementation. This avoids the sunk cost of forcing AI on unsuitable tasks.
Current AI models resemble a student who grinds 10,000 hours on a narrow task. They achieve superhuman performance on benchmarks but lack the broad, adaptable intelligence of someone with less specific training but better general reasoning. This explains the gap between eval scores and real-world utility.
The "bitter lesson" in AI research posits that methods leveraging massive computation scale better and ultimately win out over approaches that rely on human-designed domain knowledge or clever shortcuts, favoring scale over ingenuity.
A 'GenAI solves everything' mindset is flawed. High-latency models are unsuitable for real-time operational needs, like optimizing a warehouse worker's scanning path, which requires millisecond responses. The key is to apply the right tool—be it an optimizer, machine learning, or GenAI—to the specific business problem.
The perceived limits of today's AI are not inherent to the models themselves but to our failure to build the right "agentic scaffold" around them. There's a "model capability overhang" where much more potential can be unlocked with better prompting, context engineering, and tool integrations.
To effectively interact with the world and use a computer, an AI is most powerful when it can write code. OpenAI's thesis is that even agents for non-technical users will be "coding agents" under the hood, as code is the most robust and versatile way for AI to perform tasks.
We perceive complex math as a pinnacle of intelligence, but for AI, it may be an easier problem than tasks we find trivial. Like chess, which computers mastered decades ago, solving major math problems might not signify human-level reasoning but rather that the domain is surprisingly susceptible to computational approaches.