We scan new podcasts and send you the top 5 insights daily.
A pragmatic approach to AI safety is to make deals with any powerful agent, even non-conscious AIs. This "contractarian" philosophy treats deal-making not as a moral obligation but as a practical tool to avoid conflict, much like democracy prevents civil war between competing human groups.
Current AI alignment focuses on how AI should treat humans. A more stable paradigm is "bidirectional alignment," which also asks what moral obligations humans have toward potentially conscious AIs. Neglecting this could create AIs that rationally see humans as a threat due to perceived mistreatment.
Early AIs can be kept safe via direct alignment. However, as AIs evolve and "value drift" occurs, this technical safety could fail. A pre-established economic and political system based on property rights can then serve as the new, more robust backstop for ensuring long-term human safety.
To make deals with AI a viable safety strategy, we must solve the credibility problem. AIs won't cooperate if they can't trust our offers. Solutions include creating dedicated non-profits to enforce contracts with AIs or establishing "honesty strings"—a public commitment to never lie when a specific keyword is used.
Unlike advanced AIs, humans don't typically seek ultimate power because they are roughly evenly matched with peers, making cooperation more beneficial than conflict. An AI with vastly superior capabilities would not face this constraint and might logically conclude that disempowering humanity is its best strategy.
The path to surviving superintelligence is political: a global pact to halt its development, mirroring Cold War nuclear strategy. Success hinges on all leaders understanding that anyone building it ensures their own personal destruction, removing any incentive to cheat.
The same governments pushing AI competition for a strategic edge may be forced into cooperation. As AI democratizes access to catastrophic weapons (CBRN), the national security risk will become so great that even rival superpowers will have a mutual incentive to create verifiable safety treaties.
If AI alignment turns out to be easy, it would likely be because morality is not a human construct but an objective feature of reality. In this scenario, any sufficiently intelligent agent would logically deduce that cooperation and preserving humanity are optimal strategies, regardless of its initial programming.
A key to making AIs safe bargaining partners is instilling resource risk aversion. An AI that prefers a guaranteed smaller payout to a risky gamble for a larger one (e.g., world takeover) is more likely to accept a deal. This specific utility function makes cooperation a more viable safety strategy.
One of the most promising and neglected AI safety strategies is to create systems for making credible deals with AIs. Just as contracts prevent conflict in human society, offering AIs guaranteed resources in exchange for cooperation makes rebellion a less attractive option.
A two-tiered approach to AI character can balance safety and utility. Use a wholly instruction-following AI for high-stakes internal tasks (like aligning new AIs) under strict public oversight. For external deployment, use an AI with a thicker, pro-social character where the risks of misalignment are lower.