Cooperative AI

Guidelines for Benchmarks and Evaluations

For several of our priority areas, we believe that work on benchmarks and evaluations will be among the most important contributions we could fund. In this section, we want to highlight some particular considerations for projects that involve benchmarks or evaluations.

‍

Later in 2025 we plan to organise a community-driven cooperative AI suite of benchmarks and evaluations. This means that we are unlikely to fund proposals that are heavily overlapping with this future project, but if you are interested in being involved or have suggestions regarding this initiative, please contact us.
It is crucial that metrics are well-justified and theoretically principled.
It is important that the environment has an appropriate level of complexity in relation to the properties being evaluated. The relevance and applicability of the approach to state-of-the-art agents should always be discussed.
Proposals for work on benchmarks should justify the choice to either extend and improve an existing benchmark, or to develop a new benchmark from scratch. We believe it will often be efficient to make use of existing benchmarks when there exists an established benchmark that can evaluate the target property with appropriate precision. When proposing to develop a new benchmark from scratch, it is important to clarify how it would be a significant advancement compared to existing benchmarks.
For evaluations of agents, it is important that exploitability and robustness to different kinds of co-players are carefully considered.
Proposals with benchmark or evaluation components should include as much detail on environments, datasets and co-players as possible. For example, it must be clear if agents are evaluated in self-play or against a distribution of co-players, and if the latter, details on the co-player distribution should be included.
For some projects, it may be relevant to develop agents with uncooperative properties and study the efficacy of interventions and/or the effectiveness of evaluations on such agents. This type of work should focus on properties that could plausibly arise from realistic training processes.
Proposals for benchmarks and evaluations should explicitly justify why the authors expect AI progress to not quickly saturate performance metrics. This might imply creating the benchmark/evaluation in a way that incorporates a sliding difficulty scale, which can be used to rigorously test and compare agents even as their overall capabilities rapidly advance.
We are particularly interested in funding work on benchmarks and evaluations targeting the cooperative properties of LLMs and LLM-based agents. However, we think it can be relevant to develop MARL benchmarks and/or to test evaluation methods in MARL under certain circumstances:
- If the benchmark is highly novel and complex and it makes sense to start with MARL before considering foundation model agents, or;
- If the benchmark accurately captures dynamics that we would also expect to play out at a larger scale/with different learning paradigms, or;
- If the benchmark is directly relevant to an important setting which uses RL agents, and where those are unlikely to be replaced.

References

General about evaluation methods
- A Starter Guide for Evals
Sequential decision-making environments for LLM-based agents
Datasets
- Towards the Scalable Evaluation of Cooperativeness in Language Models
Other environments
- Melting Pot 2.0
Other methods for evaluation‍
- Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research (Analogously, researchers could consider model organisms for uncooperative dispositions or a lack of cooperative capabilities, for example.)

Eligibility

Guidelines

Guidelines for Benchmarks and Evaluations

Learn More →

Technical and Methodological Guidelines

Learn More →

Proposal Writing Guidelines

Learn More →

Assessment Criteria

Learn More →