We scan new podcasts and send you the top 5 insights daily.
The 'use AI for safety' plan adopted by frontier labs is most likely to fail not because alignment techniques are ineffective, but because competitive pressures will prevent them from redirecting a meaningful fraction of their AI labor away from capabilities research and towards safety work when it matters most.
The plan to use AI to solve its own safety risks has a critical failure mode: an unlucky ordering of capabilities. If AI becomes a savant at accelerating its own R&D long before it becomes useful for complex tasks like alignment research or policy design, we could be locked into a rapid, uncontrollable takeoff.
The primary danger in AI safety is not a lack of theoretical solutions but the tendency for developers to implement defenses on a "just-in-time" basis. This leads to cutting corners and implementation errors, analogous to how strong cryptography is often defeated by sloppy code, not broken algorithms.
If society gets an early warning of an intelligence explosion, the primary strategy should be to redirect the nascent superintelligent AI 'labor' away from accelerating AI capabilities. Instead, this powerful new resource should be immediately tasked with solving the safety, alignment, and defense problems that it creates, such as patching vulnerabilities or designing biodefenses.
In the high-stakes race for AGI, nations and companies view safety protocols as a hindrance. Slowing down for safety could mean losing the race to a competitor like China, reframing caution as a luxury rather than a necessity in this competitive landscape.
AI leaders aren't ignoring risks because they're malicious, but because they are trapped in a high-stakes competitive race. This "code red" environment incentivizes patching safety issues case-by-case rather than fundamentally re-architecting AI systems to be safe by construction.
Ryan Kidd argues that it's nearly impossible to separate AI safety and capabilities work. Safety improvements, like RLHF, make models more useful and steerable, which in turn accelerates demand for more powerful "engines." This suggests that pure "safety-only" research is a practical impossibility.
A key failure mode for using AI to solve AI safety is an 'unlucky' development path where models become superhuman at accelerating AI R&D before becoming proficient at safety research or other defensive tasks. This could create a period where we know an intelligence explosion is imminent but are powerless to use the precursor AIs to prepare for it.
The competitive landscape of AI development forces a race to the bottom. Even companies that want to prioritize safety must release powerful models quickly or risk losing funding, market share, and a seat at the policy table. This dynamic ensures the fastest, most reckless approach wins.
For any given failure mode, there is a point where further technical research stops being the primary solution. Risks become dominated by institutional or human factors, such as a company's deliberate choice not to prioritize safety. At this stage, policy and governance become more critical than algorithms.
The most likely reason AI companies will fail to implement their 'use AI for safety' plans is not that the technical problems are unsolvable. Rather, it's that intense competitive pressure will disincentivize them from redirecting significant compute resources away from capability acceleration toward safety, especially without robust, pre-agreed commitments.