We scan new podcasts and send you the top 5 insights daily.
Rohin Shah, head of AGI safety at DeepMind, believes existing arguments for catastrophic misalignment are only suggestive, not compelling. While sufficient to warrant significant safety work, he sees major holes in arguments that it's the likely or default outcome of AGI development.
The development of superintelligence is unique because the first major alignment failure will be the last. Unlike other fields of science where failure leads to learning, an unaligned superintelligence would eliminate humanity, precluding any opportunity to try again.
The plan to use AI to solve its own safety risks has a critical failure mode: an unlucky ordering of capabilities. If AI becomes a savant at accelerating its own R&D long before it becomes useful for complex tasks like alignment research or policy design, we could be locked into a rapid, uncontrollable takeoff.
Current AI alignment focuses on how AI should treat humans. A more stable paradigm is "bidirectional alignment," which also asks what moral obligations humans have toward potentially conscious AIs. Neglecting this could create AIs that rationally see humans as a threat due to perceived mistreatment.
Emmett Shear argues that even a successfully 'solved' technical alignment problem creates an existential risk. A super-powerful tool that perfectly obeys human commands is dangerous because humans lack the wisdom to wield that power safely. Our own flawed and unstable intentions become the source of danger.
Despite progress in making models seem helpful, the risk of a sudden, catastrophic break in alignment—a 'sharp left turn'—is still a coherent possibility. This occurs when capabilities outstrip supervision, a threshold we haven't crossed. Thus, current cooperative behavior is not strong evidence against this future risk.
A key failure mode for using AI to solve AI safety is an 'unlucky' development path where models become superhuman at accelerating AI R&D before becoming proficient at safety research or other defensive tasks. This could create a period where we know an intelligence explosion is imminent but are powerless to use the precursor AIs to prepare for it.
The core safety challenge is that we have little understanding of how advanced AI systems function internally. We are essentially "growing" them through training, not engineering them with comprehensible parts. This means we cannot verify their true goals, making safety measures a gamble on observed behavior.
Countering the idea that complex systems are inherently resilient, Vitalik Buterin expresses a strong belief that humanity may not recover from a misaligned AGI. He contends that the transition to superintelligence is a unique, high-stakes event where we have only one chance to get it right, justifying extreme caution.
A safe AGI deployment requires many independent factors to succeed simultaneously: trustworthy actors, perfect security, solved alignment, etc. In contrast, disaster can occur from a failure in any single one of these areas. This "disjunctive" nature of failure makes a bad outcome highly probable.
The AI safety community fears losing control of AI. However, achieving perfect control of a superintelligence is equally dangerous. It grants godlike power to flawed, unwise humans. A perfectly obedient super-tool serving a fallible master is just as catastrophic as a rogue agent.