Sam Altman acknowledged that models are becoming "spiky," with capabilities improving unevenly. OpenAI intentionally prioritized making GPT-5.2 excel at reasoning and coding, which led to a degradation in its creative writing and prose. This highlights the trade-offs inherent in current model training.
Salesforce's AI Chief warns of "jagged intelligence," where LLMs can perform brilliant, complex tasks but fail at simple common-sense ones. This inconsistency is a significant business risk, as a failure in a basic but crucial task (e.g., loan calculation) can have severe consequences.
Reports that OpenAI hasn't completed a new full-scale pre-training run since May 2024 suggest a strategic shift. The race for raw model scale may be less critical than enhancing existing models with better reasoning and product features that customers demand. The business goal is profit, not necessarily achieving the next level of model intelligence.
AI's capabilities are highly uneven. Models are already superhuman in specific domains like speaking 150 languages or possessing encyclopedic knowledge. However, they still fail at tasks typical humans find easy, such as continual learning or nuanced visual reasoning like understanding perspective in a photo.
Current AI models resemble a student who grinds 10,000 hours on a narrow task. They achieve superhuman performance on benchmarks but lack the broad, adaptable intelligence of someone with less specific training but better general reasoning. This explains the gap between eval scores and real-world utility.
Newer LLMs exhibit a more homogenized writing style than earlier versions like GPT-3. This is due to "style burn-in," where training on outputs from previous generations reinforces a specific, often less creative, tone. The model’s style becomes path-dependent, losing the raw variety of its original training data.
Sam Altman confesses he is surprised by how little the core ChatGPT interface has changed. He initially believed the simple chat format was a temporary research preview and would need significant evolution to become a widely used product, but its generality proved far more powerful than he anticipated.
The perceived plateau in AI model performance is specific to consumer applications, where GPT-4 level reasoning is sufficient. The real future gains are in enterprise and code generation, which still have a massive runway for improvement. Consumer AI needs better integration, not just stronger models.
AI models excel at specific tasks (like evals) because they are trained exhaustively on narrow datasets, akin to a student practicing 10,000 hours for a coding competition. While they become experts in that domain, they fail to develop the broader judgment and generalization skills needed for real-world success.
OpenAI's CEO believes a significant gap exists between what current AI models can do and how people actually use them. He calls this "overhang," suggesting most users still query powerful models with simple tasks, leaving immense economic value untapped because human workflows adapt slowly.
Current AI models exhibit "jagged intelligence," performing at a PhD level on some tasks but failing at simple ones. Google DeepMind's CEO identifies this inconsistency and lack of reliability as a primary barrier to achieving true, general-purpose AGI.