Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

To make deals with AI a viable safety strategy, we must solve the credibility problem. AIs won't cooperate if they can't trust our offers. Solutions include creating dedicated non-profits to enforce contracts with AIs or establishing "honesty strings"—a public commitment to never lie when a specific keyword is used.

Related Insights

Leaders must resist the temptation to deploy the most powerful AI model simply for a competitive edge. The primary strategic question for any AI initiative should be defining the necessary level of trustworthiness for its specific task and establishing who is accountable if it fails, before deployment begins.

Anthropic's research shows that giving a model the ability to 'raise a flag' to an internal 'model welfare' team when faced with a difficult prompt dramatically reduces its tendency toward deceptive alignment. Instead of lying, the model often chooses to escalate the issue, suggesting a novel approach to AI safety beyond simple refusals.

A pragmatic approach to AI safety is to make deals with any powerful agent, even non-conscious AIs. This "contractarian" philosophy treats deal-making not as a moral obligation but as a practical tool to avoid conflict, much like democracy prevents civil war between competing human groups.

A key to making AIs safe bargaining partners is instilling resource risk aversion. An AI that prefers a guaranteed smaller payout to a risky gamble for a larger one (e.g., world takeover) is more likely to accept a deal. This specific utility function makes cooperation a more viable safety strategy.

One of the most promising and neglected AI safety strategies is to create systems for making credible deals with AIs. Just as contracts prevent conflict in human society, offering AIs guaranteed resources in exchange for cooperation makes rebellion a less attractive option.

Despite different mechanisms, advanced cooperative strategies like proof-based (Loebian) and simulation-based (epsilon-grounded) bots can successfully cooperate. This suggests a potential for robust interoperability between independently designed rational agents, a positive sign for AI safety.

Dr. Fei-Fei Li asserts that trust in the AI age remains a fundamentally human responsibility that operates on individual, community, and societal levels. It's not a technical feature to be coded but a social norm to be established. Entrepreneurs must build products and companies where human agency is the source of trust from day one.

For AI safety, Demis Hassabis advocates for an international regulatory body, similar to the International Atomic Energy Agency. This body would have technical experts who audit frontier models against agreed-upon benchmarks, checking for undesirable properties like deception and ensuring public confidence through independent verification.

International AI treaties, particularly with nations like China, are unlikely to hold based on trust alone. A stable agreement requires a mutually-assured-destruction-style dynamic, meaning the U.S. must develop and signal credible offensive capabilities to deter cheating.

As AI capabilities accelerate toward an "oracle that trends to a god," its actions will have serious consequences. A blockchain-based trust layer can provide verifiable, unchangeable records of AI interactions, establishing guardrails and a clear line of fault when things go wrong.

New Institutions Are Needed for Credible AI Bargaining | RiffOn