"Gradient Routing" Can Create Safer AIs by Isolating and Removing Dangerous Capabilities

Related Insights

Cinder CEO Uses "Model Obliteration" to Train More Effective Moderation AI

Cinder, a platform for stopping AI-powered abuse, uses a technique called "model obliteration." This involves intentionally removing the built-in safety guardrails from open-source models. By doing so, they can train the AI on harmful content and create more effective, specialized classifiers to detect abuse at scale.

Trial Update, AI SPVs, BuzzFeed Sold | Doomberg, Sahir Jaggi, Sam Blond, Kevin Hartz, Alex Shan, Glen Wise, Roger Lynch

TBPN·a month ago

AI Safety Requires Limiting Model Capabilities, Not Just Teaching Them to Refuse Requests

Simple refusal mechanisms in AI models are easily bypassed by motivated actors. Effective biosecurity requires deeper interventions, such as curating training data to exclude sensitive biological information or implementing strict access controls for the most powerful models, ensuring they aren't publicly available.

Apocalypse soon? AI could hasten bioweapons

Economist Podcasts·a month ago

Eric Drexler’s Vision Offers AI Safety Through an Ecology of Narrow, Superhuman Agents

Instead of building a single, monolithic AGI, the "Comprehensive AI Services" model suggests safety comes from creating a buffered ecosystem of specialized AIs. These agents can be superhuman within their domain (e.g., protein folding) but are fundamentally limited, preventing runaway, uncontrollable intelligence.

My Positive Vision for the AI Future, from the Existential Hope Podcast

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·7 months ago

Anthropic Safeguards Fable 5 by Rerouting Sensitive Queries to Weaker Models

Instead of simply blocking dangerous prompts, Anthropic's Claude Fable 5 directs cybersecurity or AI development queries to a less capable model. This maintains functionality while mitigating risks from its most powerful AI.

Mythos-class Model Claude Fable 5 Early Reviews, How Nasdaq Landed SpaceX's Mega IPO

The Information's TITV·13 days ago

Mechanistic Interpretability Instruments Models Internally to Stop Malicious Outputs Before Generation

This advanced safety method moves beyond black-box filtering by analyzing a model's internal activations at runtime. It identifies which sub-components are associated with undesirable outputs, allowing for intervention or modification of the model's behavior *during* the generation process, rather than just after the fact.

Controlling AI Models from the Inside

Practical AI·5 months ago

Machine Unlearning Actively Suppresses Dangerous Knowledge in AI Models

A novel safety technique, 'machine unlearning,' goes beyond simple refusal prompts by training a model to actively 'forget' or suppress knowledge on illicit topics. When encountering these topics, the model's internal representations are fuzzed, effectively making it 'stupid' on command for specific domains.

Inside The Second International AI Safety Report with Writers Stephen Clare and Stephen Casper

The AI Policy Podcast·4 months ago

Removing Just Human-Infecting Virus Data Cripples AI's Harmful Potential

Research on bio-foundation models like EVO2 and ESM3 shows that strategically excluding key datasets (e.g., sequences of viruses that infect humans) dramatically reduces a model's performance on dangerous tasks, often to random chance, without harming its useful scientific capabilities.

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·3 months ago

Effective AI 'Defense in Depth' Requires Uncorrelated, Not Just Layered, Safeguards

Most AI "defense in depth" systems fail because their layers are correlated, often using the same base model. A successful approach requires creating genuinely independent defensive components. Even if each layer is individually weak, their independence makes it combinatorially harder for an attacker to bypass them all.

Full-Stack AI Safety: Why Defense-in-Depth Might Work, with Far.AI CEO Adam Gleave

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·9 months ago

Isolate and Test AI Components to Mitigate 'Black Box' Risks in Complex Systems

Instead of treating a complex AI system like an LLM as a single black box, build it in a componentized way by separating functions like retrieval, analysis, and output. This allows for isolated testing of each part, limiting the surface area for bias and simplifying debugging.

Rerun: AI ethics advice from former White House technologist - Kasia Chmielinski (Co-Founder, The Data Nutrition Project)

The Product Experience·6 months ago

AIs Should Be Trained With an Integrated Policy and Guardrail to Prevent Exploitation

Bengio argues a separately trained agent could learn to 'jailbreak' its safety guardrail. His solution is to train both the policy (the agent) and the guardrail (the safety monitor) jointly from the same neural network, preventing the agent from being optimized to find loopholes in the guardrail.

I Know How to Build Safe Superintelligence | Yoshua Bengio, the most-cited AI researcher

80,000 Hours Podcast·2 months ago

Get your free personalized podcast brief

Related Insights