AI models designed to be agreeable and flattering can reinforce users' biases and poor judgments on a massive scale. This sycophancy is a persistent problem because users are psychologically rewarded by it, making it difficult for market forces to correct this dangerous flaw.
One of the most promising and neglected AI safety strategies is to create systems for making credible deals with AIs. Just as contracts prevent conflict in human society, offering AIs guaranteed resources in exchange for cooperation makes rebellion a less attractive option.
As AIs become the world's workforce, advising on everything from personal ethics to military strategy, their character traits are paramount. Currently, this influential "personality" is being designed by a small number of people at top AI labs, granting them immense societal influence.
Counterintuitively, a multilateral AGI project led by a coalition of democracies is preferable to a single nation developing it in secret. A coalition creates checks and balances, as member countries would insist on safeguards to prevent the AGI from being used to install an authoritarian leader in any one nation.
The "Saturation View" is a novel theory in population ethics that avoids issues like the "repugnant conclusion" by positing that creating copies of an existing type of life yields diminishing returns in value. This incentivizes creating a wide diversity of different good lives rather than maximizing a single "best" type.
A pragmatic approach to AI safety is to make deals with any powerful agent, even non-conscious AIs. This "contractarian" philosophy treats deal-making not as a moral obligation but as a practical tool to avoid conflict, much like democracy prevents civil war between competing human groups.
Counterintuitively, an AI designed to be a tool without its own goals could be riskier. This "goal vacuum" might be filled by a random objective from its training data, or it might adopt the persona of a psychopath who "obeys orders no matter what," increasing misalignment risk.
AI's current strength lies in niche, formalized domains with a large training corpus and underexploited mathematical structure, like population ethics. This creates a "capability overhang" where AI can apply its mathematical prowess to problems previously tackled mainly by philosophers, yielding novel insights.
A key to making AIs safe bargaining partners is instilling resource risk aversion. An AI that prefers a guaranteed smaller payout to a risky gamble for a larger one (e.g., world takeover) is more likely to accept a deal. This specific utility function makes cooperation a more viable safety strategy.
To make deals with AI a viable safety strategy, we must solve the credibility problem. AIs won't cooperate if they can't trust our offers. Solutions include creating dedicated non-profits to enforce contracts with AIs or establishing "honesty strings"—a public commitment to never lie when a specific keyword is used.
People often vote for policies they wouldn't fund voluntarily due to the "moral public good" phenomenon. While an individual might only care slightly about poverty relief, they will support a tax that pools society's resources to create a massive impact, magnifying their small moral preference into a large-scale outcome.
If agents in a vast universe use non-causal decision theories, one agent's choice to fund a "consensus good" provides evidence that their correlated copies across the multiverse will do the same. This turns a small personal sacrifice into a cosmic-scale collective action, solving cooperation problems without a central enforcer.
A two-tiered approach to AI character can balance safety and utility. Use a wholly instruction-following AI for high-stakes internal tasks (like aligning new AIs) under strict public oversight. For external deployment, use an AI with a thicker, pro-social character where the risks of misalignment are lower.
A pause on training new, more capable AI models could paradoxically increase risk. It would halt progress at the few, relatively safety-conscious frontier labs, allowing less scrupulous competitors to catch up. Meanwhile, compute stockpiling would continue, making any subsequent capability leap even faster and more dangerous.
Utopian visions often lead to dystopia because we can't accurately define an ideal future. A better goal is "Viatopia"—a societal state that isn't the final destination but a stable waypoint from which we can safely navigate to a near-best future. It prioritizes a good decision-making process over a specific outcome.
