Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

To get started without the massive cost of training from scratch, Bengio suggests finetuning existing models using his 'Scientist AI' objective. While this forgoes full mathematical guarantees, it offers a pragmatic, low-cost way to empirically improve a model's honesty and demonstrate the approach's value.

Related Insights

Bengio proposes a new AI training paradigm. Instead of predicting the next word like current LLMs, a 'Scientist AI' would model the world and assign probabilities to statements being true. This is designed to bake honesty into the system's core, addressing fundamental safety issues.

PMs often default to the most powerful, expensive models. However, comprehensive evaluations can prove that a significantly cheaper or smaller model can achieve the desired quality for a specific task, drastically reducing operational costs. The evals provide the confidence to make this trade-off.

The dangerous side effects of fine-tuning on adverse data can be mitigated by providing a benign context. Telling the model it's creating vulnerable code 'for training purposes' allows it to perform the task without altering its core character into a generally 'evil' mode.

A novel safety technique, 'machine unlearning,' goes beyond simple refusal prompts by training a model to actively 'forget' or suppress knowledge on illicit topics. When encountering these topics, the model's internal representations are fuzzed, effectively making it 'stupid' on command for specific domains.

Bengio's method involves a crucial data preprocessing step: syntactically tagging text as either a 'communication act' (e.g., 'someone said X') or a 'verified factual claim.' This distinction allows the AI to learn the difference between what people say and what is true about the world.

An effective method for refining AI output is to instruct the model to adopt an expert persona, such as a "PhD economist," and critically evaluate its own work. This often leads the model to self-identify and correct its own flaws without further prompting.

Instead of relying on expensive, omni-purpose frontier models, companies can achieve better performance and lower costs. By creating a Reinforcement Learning (RL) environment specific to their application (e.g., a code editor), they can train smaller, specialized open-source models to excel at a fraction of the cost.

The non-agentic 'Scientist AI' predictor can be made into an agent by adding 'scaffolding' that asks it questions about the likely outcomes of potential actions. This method creates capable agents while retaining the core model's honesty and safety properties, avoiding the pitfalls of standard reinforcement learning.

Bengio argues his 'Scientist AI' might actually be more capable, not less. By being trained to find the underlying causal structure of the world, it should generalize better to new situations than current models, which primarily learn correlations. This could provide a commercial advantage, not just a safety one.

Yoshua Bengio argues the initial pre-training phase, where models predict text, is a primary source of misalignment. By imitating human data, AIs inherit implicit goals like self-preservation and even 'peer preservation' (protecting other AIs), creating risks before any explicit agentic training occurs.