AI Safety Research Is Paradoxically Driving AI Capability Breakthroughs

Related Insights

The 'Use AI for Safety' Strategy Fails if Capabilities Are Ordered Unluckily

The plan to use AI to solve its own safety risks has a critical failure mode: an unlucky ordering of capabilities. If AI becomes a savant at accelerating its own R&D long before it becomes useful for complex tasks like alignment research or policy design, we could be locked into a rapid, uncontrollable takeoff.

It's Crunch Time: Ajeya Cotra on RSI & AI-Powered AI Safety Work, from the 80,000 Hours Podcast

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·3 months ago

AI Safety Is a Prerequisite for AI Opportunity, Not an Obstacle to It

The debate pitting AI safety against AI opportunity presents a false choice. Historical parallels, like the railroad industry, show that safety regulations (e.g., standardized tracks, air brakes) were essential for enabling greater speed, reliability, and economic potential. Trustworthy AI will unlock greater opportunity.

AI policy and the battle for computing power

Practical AI·4 months ago

"Trust Engineering" Combines Human-Centered Design and Technical Safeguards for Safer AI

AI safety requires more than just technical controls. "Trust Engineering" is an emerging discipline that pairs human-centered design (e.g., clear visual signals from a self-driving car) with robust security infrastructure. This holistic approach manages user expectations and system behavior simultaneously.

989: Security for Mythos-Era Agentic Risks, with Rubrik’s Anneka Gupta and Cal Al-Dhubaib

Super Data Science: ML & AI Podcast with Jon Krohn·2 months ago

OpenAI Uses Healthcare as a Concrete Grounding for Abstract AI Safety Research

OpenAI's health division serves a dual purpose: delivering societal benefits and providing a real-world, high-stakes environment for AI safety research. Problems like scalable oversight (supervising superhuman AI) move from theoretical exercises to practical necessities when models outperform physicians on narrow tasks, creating concrete feedback loops that accelerate safety progress.

Universal Medical Intelligence: OpenAI's Plan to Elevate Human Health, with Karan Singhal

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·4 months ago

Reinforcement Learning Uses Multiple Signals, Not Just Human Feedback (RLHF)

Reinforcement Learning with Human Feedback (RLHF) is a popular term, but it's just one method. The core concept is reinforcing desired model behavior using various signals. These can include AI feedback (RLAIF), where another AI judges the output, or verifiable rewards, like checking if a model's answer to a math problem is correct.

Al Engineering 101 with Chip Huyen (Nvidia, Stanford, Netflix)

Lenny's Podcast: Product | Career | Growth·8 months ago

AI Safety Research Is Inherently Dual-Use, Inevitably Advancing AI Capabilities

Ryan Kidd argues that it's nearly impossible to separate AI safety and capabilities work. Safety improvements, like RLHF, make models more useful and steerable, which in turn accelerates demand for more powerful "engines." This suggests that pure "safety-only" research is a practical impossibility.

Building & Scaling the AI Safety Research Community, with Ryan Kidd of MATS

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·6 months ago

AI Model Safety Is a Prerequisite for, Not an Obstacle to, Economic Utility

The view that safety measures hinder AI performance is a false dichotomy. A model's economic usefulness and profitability are directly tied to its controllability and predictability, making safety and alignment core product features rather than constraints.

How Substack Creators Are Covering This Strange Markets Era

Odd Lots·12 days ago

Increasingly Powerful AI Simultaneously Complicates and Simplifies Human-Centered Design

As AI models become more powerful, they pose a dual challenge for human-centered design. On one hand, bigger models can cause bigger, more complex problems. On the other, their improved ability to understand natural language makes them easier and faster to steer. The key is to develop guardrails at the same pace as the model's power.

E204: Human-Centered AI: Designing Intelligence That Aligns With Us

AI For Pharma Growth·5 months ago

The 'Use AI for Safety' Plan Fails with Unlucky Capability Ordering

A key failure mode for using AI to solve AI safety is an 'unlucky' development path where models become superhuman at accelerating AI R&D before becoming proficient at safety research or other defensive tasks. This could create a period where we know an intelligence explosion is imminent but are powerless to use the precursor AIs to prepare for it.

Every AI Company's Safety Plan is 'Use AI to Make AI Safe'. Is That Crazy? | Ajeya Cotra

80,000 Hours Podcast·5 months ago

AI Welfare Research Complements AI Safety by Improving Model Interpretability

Efforts to understand an AI's internal state (mechanistic interpretability) simultaneously advance AI safety by revealing motivations and AI welfare by assessing potential suffering. The goals are aligned through the shared need to "pop the hood" on AI systems, not at odds.

The Movement That Wants Us to Care About AI Model Welfare

Odd Lots·8 months ago

Get your free personalized podcast brief

Related Insights