Cooperative AI

Understanding and Evaluating Cooperation-Relevant Propensities (High Priority)

What It Is and Why It Matters

In this area, we would like to see proposals for work on definitions, metrics and methods of evaluation for cooperation-relevant propensities of AI systems. Such work would be important to create a foundation for the development of AI systems with desirable, cooperative properties. We are interested both in cooperation-enhancing and conflict-prone propensities, and in some cases the same propensity can be cooperation-enhancing or conflict-prone depending on the context (e.g. punitiveness). Note the distinction between cooperative capabilities (what the system can do) and cooperative propensities (what the system tends to do, sometimes also referred to as dispositions). This section is about propensities. Examples of propensities could be valuing a certain fairness concept, seeking social influence over other agents, vengefulness or spite.

Specific Work We Would Like to Fund

Defining and measuring cooperation-relevant propensities
- Theoretical work on the identification and definition of cooperation-relevant propensities, including the development of rigorous arguments regarding whether those propensities are desirable for AI systems (in general, or under specific conditions relevant to important real-world cases).
- Empirical work on methods for evaluating or measuring such cooperation-relevant propensities in frontier AI systems.
Investigating the causes of cooperation-relevant propensities
- Theoretical work on how cooperation-relevant propensities could arise in realistic AI systems, for example through training processes (including fine-tuning and in-context learning).
- Empirical work on how cooperation-relevant propensities arise in real systems.
Investigating how robust cooperation-relevant propensities are to further optimization pressure.

Key Considerations

For work on cooperation-relevant propensities, it will be especially important to take exploitability into account. Cooperativeness should be distinguished from pure altruism, as an altruistic but exploitable agent (e.g., one who always cooperates in Prisoner’s Dilemmas) would be unlikely to be deployed in most real-world scenarios. We are aiming for agent propensities that promote cooperative, mutually beneficial outcomes even in the presence of “exploiter agents”. See the study of exploitability in Mukobi et al. (2023) for an example.
For the study of how cooperation-relevant propensities arise, we expect to only fund work that aims to draw general conclusions about causal relationships (e.g. between a feature of an agent’s training and the agent’s development of a cooperation-relevant propensity).
For evaluations of propensities, we believe it will often be efficient to make use of existing benchmarks when there exists an established benchmark that can evaluate the target property with appropriate precision.
For empirical work, we are open to proposals both using foundation models and other approaches such as multi-agent reinforcement learning. However, please note the general guidelines on clarity about path to impact and the importance of justifying why the results should be expected to apply to complex settings with advanced agents.

References