Despite being a language model, ChatGPT's most valuable application in a data journalism experiment was not reporting or summarizing but its ability to generate and debug Python code for a map. This technical capability proved more efficient and reliable than its core content-related functions.
A practical hack to improve AI agent reliability is to avoid built-in tool-calling functions. LLMs have more training data on writing code than on specific tool-use APIs. Prompting the agent to write and execute the code that calls a tool leverages its core strength and produces better outcomes.
LLMs shine when acting as a 'knowledge extruder'—shaping well-documented, 'in-distribution' concepts into specific code. They fail when the core task is novel problem-solving where deep thinking, not code generation, is the bottleneck. In these cases, the code is the easy part.
Browser-based ChatGPT cannot execute code or connect to external APIs, limiting its power. The Codex CLI unlocks these agentic capabilities, allowing it to interact with local files, run scripts, and connect to databases, making it a far more powerful tool for real-world tasks.
Karpathy found AI coding agents struggle with genuinely novel projects like his NanoChat repository. Their training on common internet patterns causes them to misunderstand custom implementations and try to force standard, but incorrect, solutions. They are good for autocomplete and boilerplate but not for intellectually intense, frontier work.
Despite the hype around AI's coding prowess, an OpenAI study reveals it is a niche activity on consumer plans, accounting for only 4% of messages. The vast majority of usage is for more practical, everyday guidance like writing help, information seeking, and general advice.
Coding is a unique domain that severely tests LLM capabilities. Unlike other use cases, it involves extremely long-running sessions (up to 30 days for a single task), massive context accumulation from files and command outputs, and requires high precision, making it a key driver for core model research.
The primary constraint on output is no longer a tool's capability but the user's skill in prompting it. This is exemplified by a developer who created a complex real-time strategy (RTS) game from scratch in one week by prompting an AI model, having not written a single line of code himself in two months.
Craig Hewitt argues ChatGPT is a consumer product. For serious business tasks, agentic AI tools like Manus (built on Claude) are superior, offering web browsing, data aggregation, and code generation that go far beyond a simple chat interface.
To effectively interact with the world and use a computer, an AI is most powerful when it can write code. OpenAI's thesis is that even agents for non-technical users will be "coding agents" under the hood, as code is the most robust and versatile way for AI to perform tasks.
According to GitHub's COO, the initial concept for Copilot was a tool to help developers with the tedious task of writing documentation. The team pivoted when they realized the same underlying transformer model was far more powerful for generating the code itself.