AI Safety Requires Limiting Model Capabilities, Not Just Teaching Them to Refuse Requests

Related Insights

Real AI Safety Is About "Meatspace" Harm, Not Futile Attempts to Censor the "Latent Space"

The current industry approach to AI safety, which focuses on censoring a model's "latent space," is flawed and ineffective. True safety work should reorient around preventing real-world, "meatspace" harm (e.g., data breaches). Security vulnerabilities should be fixed at the system level, not by trying to "lobotomize" the model itself.

Jailbreaking AGI: Pliny the Liberator & John V on AI Red Teaming, BT6, and the Future of AI Security

Latent Space: The AI Engineer Podcast·5 months ago

Secure AI Agents by Limiting Them to Two of Three Capabilities: Files, Internet, or Code Execution

A practical security model for AI agents suggests they should only have access to a combination of two of the following three capabilities: local files, internet access, and code execution. Granting all three at once creates significant, hard-to-manage vulnerabilities.

NVIDIA's AI Engineers: Agent Inference at Planetary Scale and "Speed of Light" — Nader Khalil (Brev), Kyle Kranen (Dynamo)

Latent Space: The AI Engineer Podcast·2 months ago

Machine Unlearning Actively Suppresses Dangerous Knowledge in AI Models

A novel safety technique, 'machine unlearning,' goes beyond simple refusal prompts by training a model to actively 'forget' or suppress knowledge on illicit topics. When encountering these topics, the model's internal representations are fuzzed, effectively making it 'stupid' on command for specific domains.

Inside The Second International AI Safety Report with Writers Stephen Clare and Stephen Casper

The AI Policy Podcast·3 months ago

Controlling Dangerous Biological Data Is a More Effective Chokepoint Than Controlling AI Models

Instead of trying to control open-source AI models, which is intractable, the proposed strategy is to control the small, expensive-to-produce functional datasets they train on. This preserves the beneficial open-source ecosystem while preventing the dissemination of dangerous capabilities like viral design.

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·2 months ago

The Best AI Defenses Today Are Classic Cybersecurity Principles, Not AI Guardrails

Instead of relying on flawed AI guardrails, focus on traditional security practices. This includes strict permissioning (ensuring an AI agent can't do more than necessary) and containerizing processes (like running AI-generated code in a sandbox) to limit potential damage from a compromised AI.

The coming AI security crisis (and what to do about it) | Sander Schulhoff

Lenny's Podcast: Product | Career | Growth·5 months ago

Removing Just Human-Infecting Virus Data Cripples AI's Harmful Potential

Research on bio-foundation models like EVO2 and ESM3 shows that strategically excluding key datasets (e.g., sequences of viruses that infect humans) dramatically reduces a model's performance on dangerous tasks, often to random chance, without harming its useful scientific capabilities.

Bioinfohazards: Jassi Pannu on Controlling Dangerous Data from which AI Models Learn

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·2 months ago

External AI Guardrails Are Like Checking IDs; They Can't Stop "Inside" Threats Like Jailbreaks

Current AI safety solutions primarily act as external filters, analyzing prompts and responses. This "black box" approach is ineffective against jailbreaks and adversarial attacks that manipulate the model's internal workings to generate malicious output from seemingly benign inputs, much like a building's gate security can't stop a resident from causing harm inside.

Controlling AI Models from the Inside

Practical AI·4 months ago

Implement "Power Caps" on AI Systems to Ensure Controllability Over Maximum Performance

To balance AI capability with safety, implement "power caps" that prevent a system from operating beyond its core defined function. This approach intentionally limits performance to mitigate risks, prioritizing predictability and user comfort over achieving the absolute highest capability, which may have unintended consequences.

E204: Human-Centered AI: Designing Intelligence That Aligns With Us

AI For Pharma Growth·3 months ago

AI's Unpredictability Makes It Impossible to Reliably Block Malicious Commands

Training Large Language Models to ignore malicious 'prompt injections' is an unreliable security strategy. Because AI is inherently stochastic, a command ignored 1,000 times might be executed on the 1,001st attempt due to a random 'dice roll.' This is a sufficient success rate for persistent hackers.

Shut happens: US federal funding stops

Economist Podcasts·7 months ago

Robust AI Security Requires a "Defense in Depth" Approach Layering Multiple Safety Mechanisms

A comprehensive AI safety strategy mirrors modern cybersecurity, requiring multiple layers of protection. This includes external guardrails, static checks, and internal model instrumentation, which can be combined with system-level data (e.g., a user's refund history) to create complex, robust security rules.

Controlling AI Models from the Inside

Practical AI·4 months ago

Get your free personalized podcast brief

Related Insights