Rohin Shah, head of AGI safety at DeepMind, believes existing arguments for catastrophic misalignment are only suggestive, not compelling. While sufficient to warrant significant safety work, he sees major holes in arguments that it's the likely or default outcome of AGI development.
Prosaic AI alignment research is similar enough to capabilities research that it will likely accelerate in tandem during an intelligence explosion. The real danger is that governance—which requires different skills and societal buy-in—won't keep pace, as policymakers may be unwilling to automate their own work with AI.
Despite perceptions of rapid acceleration, a large-scale analysis by Google DeepMind and EPOC that stitches together many benchmarks over time shows that general AI capability progress has been remarkably linear. This suggests AI is currently a better tool, not an expanding population of researchers.
External pressure for AI companies to make public commitments is misguided because companies can and will back out of them if they become inconvenient or outdated. Rohin Shah points to Anthropic's Responsible Scaling Policy as an example where strong "commitment" language was later weakened.
Abstract theory from outside an AI lab is unlikely to be adopted due to immense internal implementation constraints. To be useful, external research must provide a concrete solution, a new evaluation, or a clear metric that can be easily integrated into a complex, fragile development pipeline.
A report like Google's Frontier Safety Report serves a specific purpose: to formally declare that the company has determined a model is safe to release. It is not designed to provide the level of detail needed for external actors to replicate or deeply scrutinize the evaluations; that's the role of academic papers.
A key part of Google DeepMind's safety plan is to treat powerful, internally-used AI systems as potential untrusted insiders. This means building infrastructure that gives AIs separate identities, forces them to request permissions individually with justifications, and monitors their actions for suspicious behavior.
Rohin Shah argues against AI companies making fixed safety commitments. The best practices for safety research change rapidly; a commitment made today (e.g., including alignment data in pre-training) could be considered harmful in the future, making flexibility crucial.
Requiring extensive evaluations right before a model launch creates strong incentives to make them as fast as possible, not as thorough. Shah argues progress is continuous, so a safety buffer based on the previous model is often sufficient, and the bigger risk is from internal, not external, deployment.
A technique called "myopic optimization" can prevent complex, multi-step reward hacking. By training an AI to optimize each action locally without seeing future rewards, it removes the incentive for schemes that pay off later, even if an overseer couldn't spot the deception.
Changing one component of a frontier model (like safety) can break dozens of other fragile constraints (e.g., inference speed). Companies can only implement a few changes at a time. Therefore, external actors should model them as resource-constrained and apathetic, not actively malicious, for effective advocacy.
DeepMind's Rohin Shah argues that Transformer models, optimized for parallel processing on GPUs, have low "opaque serial depth." They *must* write down their reasoning steps to their chain-of-thought scratchpad to solve complex serial tasks, making them monitorable. He predicts this will hold for 4-5 years.
Rohin Shah predicts a gradual, not abrupt, start to an intelligence explosion. It will be triggered when automated AI R&D becomes cheaper than human researchers, not when it's vastly more capable. The first automated researchers might be less insightful but use massive, expensive compute to brute-force problems.
The primary need on Google DeepMind's AGI safety team has shifted from generating novel research ideas to implementation. The team is hiring for people with strong software engineering skills who can "do the obvious thing and land it" within the company's complex infrastructure.
