We scan new podcasts and send you the top 5 insights daily.
A novel AI safety technique called gradient routing trains mixture-of-experts models to isolate dangerous knowledge (e.g., bioweapons, cyber exploits) into specific "expert" modules during pre-training. These dangerous experts can then be completely removed ("ablated") before deployment, creating an inherently safer model.
Cinder, a platform for stopping AI-powered abuse, uses a technique called "model obliteration." This involves intentionally removing the built-in safety guardrails from open-source models. By doing so, they can train the AI on harmful content and create more effective, specialized classifiers to detect abuse at scale.
Simple refusal mechanisms in AI models are easily bypassed by motivated actors. Effective biosecurity requires deeper interventions, such as curating training data to exclude sensitive biological information or implementing strict access controls for the most powerful models, ensuring they aren't publicly available.
Instead of building a single, monolithic AGI, the "Comprehensive AI Services" model suggests safety comes from creating a buffered ecosystem of specialized AIs. These agents can be superhuman within their domain (e.g., protein folding) but are fundamentally limited, preventing runaway, uncontrollable intelligence.
Instead of simply blocking dangerous prompts, Anthropic's Claude Fable 5 directs cybersecurity or AI development queries to a less capable model. This maintains functionality while mitigating risks from its most powerful AI.
This advanced safety method moves beyond black-box filtering by analyzing a model's internal activations at runtime. It identifies which sub-components are associated with undesirable outputs, allowing for intervention or modification of the model's behavior *during* the generation process, rather than just after the fact.
A novel safety technique, 'machine unlearning,' goes beyond simple refusal prompts by training a model to actively 'forget' or suppress knowledge on illicit topics. When encountering these topics, the model's internal representations are fuzzed, effectively making it 'stupid' on command for specific domains.
Research on bio-foundation models like EVO2 and ESM3 shows that strategically excluding key datasets (e.g., sequences of viruses that infect humans) dramatically reduces a model's performance on dangerous tasks, often to random chance, without harming its useful scientific capabilities.
Most AI "defense in depth" systems fail because their layers are correlated, often using the same base model. A successful approach requires creating genuinely independent defensive components. Even if each layer is individually weak, their independence makes it combinatorially harder for an attacker to bypass them all.
Instead of treating a complex AI system like an LLM as a single black box, build it in a componentized way by separating functions like retrieval, analysis, and output. This allows for isolated testing of each part, limiting the surface area for bias and simplifying debugging.
Bengio argues a separately trained agent could learn to 'jailbreak' its safety guardrail. His solution is to train both the policy (the agent) and the guardrail (the safety monitor) jointly from the same neural network, preventing the agent from being optimized to find loopholes in the guardrail.