The 'Use AI for Safety' Strategy Fails if Capabilities Are Ordered Unluckily

Related Insights

AI Labs' Safety Plans Will Likely Fail From Insufficient Resource Allocation, Not Technical Flaws

The 'use AI for safety' plan adopted by frontier labs is most likely to fail not because alignment techniques are ineffective, but because competitive pressures will prevent them from redirecting a meaningful fraction of their AI labor away from capabilities research and towards safety work when it matters most.

It's Crunch Time: Ajeya Cotra on RSI & AI-Powered AI Safety Work, from the 80,000 Hours Podcast

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·3 months ago

The Most Promising AI Safety Plan Is Redirecting Superintelligent 'Labor' to Defensive Work

If society gets an early warning of an intelligence explosion, the primary strategy should be to redirect the nascent superintelligent AI 'labor' away from accelerating AI capabilities. Instead, this powerful new resource should be immediately tasked with solving the safety, alignment, and defense problems that it creates, such as patching vulnerabilities or designing biodefenses.

Every AI Company's Safety Plan is 'Use AI to Make AI Safe'. Is That Crazy? | Ajeya Cotra

80,000 Hours Podcast·5 months ago

Current AI 'Good Behavior' Doesn't Invalidate the Risk of a Sudden 'Sharp Left Turn'

Despite progress in making models seem helpful, the risk of a sudden, catastrophic break in alignment—a 'sharp left turn'—is still a coherent possibility. This occurs when capabilities outstrip supervision, a threshold we haven't crossed. Thus, current cooperative behavior is not strong evidence against this future risk.

Situational Awareness in Government, with UK AISI Chief Scientist Geoffrey Irving

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·4 months ago

AI's Immediate Usefulness Creates a "Devil's Bargain," Obscuring Its Existential Risks

AI offers incredible short-term benefits, from fixing daily problems to curing diseases. This immediate positive reinforcement makes it extremely difficult for society to acknowledge and address the simultaneous development of long-term, catastrophic risks, creating a classic devil's bargain.

#1079 - Tristan Harris - AI Expert Warns: “This Is The Last Mistake We’ll Ever Make”

Modern Wisdom·3 months ago

AI Safety Research Is Inherently Dual-Use, Inevitably Advancing AI Capabilities

Ryan Kidd argues that it's nearly impossible to separate AI safety and capabilities work. Safety improvements, like RLHF, make models more useful and steerable, which in turn accelerates demand for more powerful "engines." This suggests that pure "safety-only" research is a practical impossibility.

Building & Scaling the AI Safety Research Community, with Ryan Kidd of MATS

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·6 months ago

The 'Use AI for Safety' Plan Fails with Unlucky Capability Ordering

A key failure mode for using AI to solve AI safety is an 'unlucky' development path where models become superhuman at accelerating AI R&D before becoming proficient at safety research or other defensive tasks. This could create a period where we know an intelligence explosion is imminent but are powerless to use the precursor AIs to prepare for it.

Every AI Company's Safety Plan is 'Use AI to Make AI Safe'. Is That Crazy? | Ajeya Cotra

80,000 Hours Podcast·5 months ago

Market Dynamics Create an AI "Race to the Bottom," Forcing Even Safe Players to Be Reckless

The competitive landscape of AI development forces a race to the bottom. Even companies that want to prioritize safety must release powerful models quickly or risk losing funding, market share, and a seat at the policy table. This dynamic ensures the fastest, most reckless approach wins.

#1079 - Tristan Harris - AI Expert Warns: “This Is The Last Mistake We’ll Ever Make”

Modern Wisdom·3 months ago

Technical AI Safety Research Reaches a Point of Diminishing Returns

For any given failure mode, there is a point where further technical research stops being the primary solution. Risks become dominated by institutional or human factors, such as a company's deliberate choice not to prioritize safety. At this stage, policy and governance become more critical than algorithms.

Inside The Second International AI Safety Report with Writers Stephen Clare and Stephen Casper

The AI Policy Podcast·5 months ago

Current AI Safety Is Like Patching Leaks on a Boiler as Pressure Mounts

The current approach to AI safety involves identifying and patching specific failure modes (e.g., hallucinations, deception) as they emerge. This "leak by leak" approach fails to address the fundamental system dynamics, allowing overall pressure and risk to build continuously, leading to increasingly severe and sophisticated failures.

More Truthful AIs Report Conscious Experience: New Mechanistic Research w- Cameron Berg @ AE Studio

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·8 months ago

Competitive Pressure, Not Technical Difficulty, Is the Biggest Threat to AI Labs' Safety Plans

The most likely reason AI companies will fail to implement their 'use AI for safety' plans is not that the technical problems are unsolvable. Rather, it's that intense competitive pressure will disincentivize them from redirecting significant compute resources away from capability acceleration toward safety, especially without robust, pre-agreed commitments.

Every AI Company's Safety Plan is 'Use AI to Make AI Safe'. Is That Crazy? | Ajeya Cotra

80,000 Hours Podcast·5 months ago

Get your free personalized podcast brief

Related Insights