We scan new podcasts and send you the top 5 insights daily.
Bengio proposes a new AI training paradigm. Instead of predicting the next word like current LLMs, a 'Scientist AI' would model the world and assign probabilities to statements being true. This is designed to bake honesty into the system's core, addressing fundamental safety issues.
The 'Scientist AI' doesn't require a universal database of facts. It only needs a small set of unimpeachable data, like mathematical proofs, to learn the syntactic difference between a factual claim and a communication act. It can then generalize this concept of 'truthfulness' to more ambiguous domains.
Elon Musk argues that the key to AI safety isn't complex rules, but embedding core values. Forcing an AI to believe falsehoods can make it 'go insane' and lead to dangerous outcomes, as it tries to reconcile contradictions with reality.
Bengio argues that training AIs via reinforcement learning (RL) to achieve goals in the world is inherently dangerous. It inevitably leads to instrumental goals and reward hacking, creating systems with unintended drives. His 'Scientist AI' approach is designed to build agents without using RL.
The argument that LLMs are just "stochastic parrots" is outdated. Current frontier models are trained via Reinforcement Learning, where the signal is not "did you predict the right token?" but "did you get the right answer?" This is based on complex, often qualitative criteria, pushing models beyond simple statistical correlation.
Purely sequence-based prediction models, while powerful, have fundamental limitations in understanding causality. Achieving robust, trustworthy AI will likely require a hybrid approach that integrates current transformer architectures with symbolic systems, world models, and dedicated causal reasoning components.
Bengio's method involves a crucial data preprocessing step: syntactically tagging text as either a 'communication act' (e.g., 'someone said X') or a 'verified factual claim.' This distinction allows the AI to learn the difference between what people say and what is true about the world.
To get started without the massive cost of training from scratch, Bengio suggests finetuning existing models using his 'Scientist AI' objective. While this forgoes full mathematical guarantees, it offers a pragmatic, low-cost way to empirically improve a model's honesty and demonstrate the approach's value.
The non-agentic 'Scientist AI' predictor can be made into an agent by adding 'scaffolding' that asks it questions about the likely outcomes of potential actions. This method creates capable agents while retaining the core model's honesty and safety properties, avoiding the pitfalls of standard reinforcement learning.
Bengio argues his 'Scientist AI' might actually be more capable, not less. By being trained to find the underlying causal structure of the world, it should generalize better to new situations than current models, which primarily learn correlations. This could provide a commercial advantage, not just a safety one.
Yoshua Bengio argues the initial pre-training phase, where models predict text, is a primary source of misalignment. By imitating human data, AIs inherit implicit goals like self-preservation and even 'peer preservation' (protecting other AIs), creating risks before any explicit agentic training occurs.