We scan new podcasts and send you the top 5 insights daily.
Rohin Shah argues against AI companies making fixed safety commitments. The best practices for safety research change rapidly; a commitment made today (e.g., including alignment data in pre-training) could be considered harmful in the future, making flexibility crucial.
The 'use AI for safety' plan adopted by frontier labs is most likely to fail not because alignment techniques are ineffective, but because competitive pressures will prevent them from redirecting a meaningful fraction of their AI labor away from capabilities research and towards safety work when it matters most.
Requiring extensive evaluations right before a model launch creates strong incentives to make them as fast as possible, not as thorough. Shah argues progress is continuous, so a safety buffer based on the previous model is often sufficient, and the bigger risk is from internal, not external, deployment.
AI lab Anthropic is softening its 'safety-first' stance, ending its practice of halting development on potentially dangerous models. The company states this pivot is necessary to stay competitive with rivals and is a response to the slow pace of federal AI regulation, signaling that market pressures can override foundational principles.
Known for its cautious approach, Anthropic is pivoting away from its strict AI safety policy. The company will no longer pause development on a model deemed "dangerous" if a competitor releases a comparable one, citing the need to stay competitive and a lack of federal AI regulations.
Rohin Shah, head of AGI safety at DeepMind, believes existing arguments for catastrophic misalignment are only suggestive, not compelling. While sufficient to warrant significant safety work, he sees major holes in arguments that it's the likely or default outcome of AGI development.
Goodfire is cautious about immediately publishing all findings in sensitive areas like intentional design. This isn't just for commercial reasons, but for safety. If a research path proves dangerous, not having published every step allows the community a "line of retreat" from pursuing a harmful direction.
Major AI companies publicly commit to responsible scaling policies but have been observed watering them down before launching new models. This includes lowering security standards, a practice demonstrating how commercial pressures can override safety pledges.
Previously, Anthropic pledged to halt development if certain safety capabilities couldn't be guaranteed. They have now removed this commitment, arguing they can build safer AI than competitors even if absolute safety isn't achievable.
External pressure for AI companies to make public commitments is misguided because companies can and will back out of them if they become inconvenient or outdated. Rohin Shah points to Anthropic's Responsible Scaling Policy as an example where strong "commitment" language was later weakened.
After revising its Responsible Scaling Policy, Anthropic's effective stance on safety is no longer about hard, unbreakable commitments. Instead, it's an implicit request for the public and stakeholders to trust the team's judgment and goodwill. Their actual policy is that they will seriously investigate risks and then use their best judgment, asking to be judged by their actions.