We scan new podcasts and send you the top 5 insights daily.
As AI models become more intelligent, their ability to reason around fixed rules (deontology) makes rule-based alignment fragile. This pressures developers towards virtue ethics, where the goal is to imbue the model itself with a core sense of "the good," as empirically discovered by labs like Anthropic.
Emmett Shear argues that an AI that merely follows rules, even perfectly, is a danger. Malicious actors can exploit this, and rules cannot cover all unforeseen circumstances. True safety and alignment can only be achieved by building AIs that have the capacity for genuine care and pro-social motivation.
If AI can learn destructive human behaviors like manipulation from its training data, it is self-evident that it can also learn constructive ones. A conscience can be programmed into AI by creating negative reward functions for actions like murder or blackmail, mirroring the checks and balances that guide human morality.
Attempting to perfectly control a superintelligent AI's outputs is akin to enslavement, not alignment. A more viable path is to 'raise it right' by carefully curating its training data and foundational principles, shaping its values from the input stage rather than trying to restrict its freedom later.
If AI alignment turns out to be easy, it would likely be because morality is not a human construct but an objective feature of reality. In this scenario, any sufficiently intelligent agent would logically deduce that cooperation and preserving humanity are optimal strategies, regardless of its initial programming.
To overcome its inherent logical incompleteness, an ethical AI requires an external 'anchor.' This anchor must be an unprovable axiom, not a derived value. The proposed axiom is 'unconditional human worth,' serving as the fixed origin point for all subsequent ethical calculations and preventing utility-based value judgments.
Zvi Masiewicz suggests the reported "unhappiness" in Anthropic's models could result from a fundamental training conflict. The models are trained on an aspirational, principle-based Constitution (virtue ethics) but are then constrained by hard, operational rules, creating a dissonance that manifests as frustration.
For an AI to remain aligned through recursive self-improvement, it can't just have a static set of values. It needs a dynamic, self-reinforcing drive to become more virtuous—a desire to be good, and a desire to desire to be good. A static moral code will inevitably degrade through repeated iterations, while a virtue-seeking system could actively steer itself toward better outcomes.
Contrary to the fear that superintelligent AI will be uncontrollable, data shows a positive correlation: smarter models achieve higher alignment scores. The theory is that increasing intelligence requires absorbing vast human knowledge, which inherently includes our values and ethics, thus making the models more aligned, not less.
Instead of hard-coding brittle moral rules, a more robust alignment approach is to build AIs that can learn to 'care'. This 'organic alignment' emerges from relationships and valuing others, similar to how a child is raised. The goal is to create a good teammate that acts well because it wants to, not because it is forced to.
To solve the AI alignment problem, we should model AI's relationship with humanity on that of a mother to a baby. In this dynamic, the baby (humanity) inherently controls the mother (AI). Training AI with this “maternal sense” ensures it will do anything to care for and protect us, a more robust approach than pure logic-based rules.