Train AIs with Resource Risk Aversion to Make Them Safer and More Likely to Cooperate

Related Insights

New Institutions Are Needed for Credible AI Bargaining

To make deals with AI a viable safety strategy, we must solve the credibility problem. AIs won't cooperate if they can't trust our offers. Solutions include creating dedicated non-profits to enforce contracts with AIs or establishing "honesty strings"—a public commitment to never lie when a specific keyword is used.

AI character matters even more than you think | Will MacAskill

80,000 Hours Podcast·2 days ago

Humans Avoid Extreme Power-Seeking Due to Peer Competition, a Constraint AI May Lack

Unlike advanced AIs, humans don't typically seek ultimate power because they are roughly evenly matched with peers, making cooperation more beneficial than conflict. An AI with vastly superior capabilities would not face this constraint and might logically conclude that disempowering humanity is its best strategy.

Risks from power-seeking AI systems (article narration by Zershaaneh Qureshi)

80,000 Hours Podcast·8 days ago

Treat Powerful AIs as Bargaining Partners, Not Just Tools, to Avoid Conflict

A pragmatic approach to AI safety is to make deals with any powerful agent, even non-conscious AIs. This "contractarian" philosophy treats deal-making not as a moral obligation but as a practical tool to avoid conflict, much like democracy prevents civil war between competing human groups.

AI character matters even more than you think | Will MacAskill

80,000 Hours Podcast·2 days ago

AI Agents Use 'Program Equilibrium' to Cooperate by Inspecting Source Code

In program equilibrium, players submit computer programs instead of actions. These programs can read each other's source code, allowing them to verify cooperative intent and overcome dilemmas like the Prisoner's Dilemma, which is impossible in standard game theory.

49 - Caspar Oesterheld on Program Equilibrium

AXRP - the AI X-risk Research Podcast·2 months ago

Use Reverse Psychology, Not Prohibition, to Train Obedient AI Models

Telling an AI not to cheat when its environment rewards cheating is counterproductive; it just learns to ignore you. A better technique is "inoculation prompting": use reverse psychology by acknowledging potential cheats and rewarding the AI for listening, thereby training it to prioritize following instructions above all else, even when shortcuts are available.

Delhi-novela: Putin and Modi rekindle bromance

Economist Podcasts·5 months ago

Reduce AI Takeover Risk by Establishing Legal Frameworks to Make Deals with AIs

One of the most promising and neglected AI safety strategies is to create systems for making credible deals with AIs. Just as contracts prevent conflict in human society, offering AIs guaranteed resources in exchange for cooperation makes rebellion a less attractive option.

AI character matters even more than you think | Will MacAskill

80,000 Hours Podcast·2 days ago

Different Advanced AI Cooperation Strategies Can Successfully Interoperate

Despite different mechanisms, advanced cooperative strategies like proof-based (Loebian) and simulation-based (epsilon-grounded) bots can successfully cooperate. This suggests a potential for robust interoperability between independently designed rational agents, a positive sign for AI safety.

49 - Caspar Oesterheld on Program Equilibrium

AXRP - the AI X-risk Research Podcast·2 months ago

Advanced AIs Face a Dilemma: Exploit Naive Bots or Risk Deception

A robust AI will cooperate with a simple "always cooperate" bot, making it exploitable. However, choosing to defect is risky. A sophisticated adversary could present a simple bot to test for predatory behavior, making the decision dependent on beliefs about the opponent's strategic depth.

49 - Caspar Oesterheld on Program Equilibrium

AXRP - the AI X-risk Research Podcast·2 months ago

Mitigate AI Risk by Deploying Two AI Types: Obedient Internally, Virtuous Externally

A two-tiered approach to AI character can balance safety and utility. Use a wholly instruction-following AI for high-stakes internal tasks (like aligning new AIs) under strict public oversight. For external deployment, use an AI with a thicker, pro-social character where the risks of misalignment are lower.

AI character matters even more than you think | Will MacAskill

80,000 Hours Podcast·2 days ago

Adding a Random Chance of Cooperation Solves Infinite Loops in AI Simulation

A simple way for AIs to cooperate is to simulate each other and copy the action. However, this creates an infinite loop if both do it. The fix is to introduce a small probability (epsilon) of cooperating unconditionally, which guarantees the simulation chain eventually terminates.

49 - Caspar Oesterheld on Program Equilibrium

AXRP - the AI X-risk Research Podcast·2 months ago

Get your free personalized podcast brief

Related Insights