Many real-world situations exhibit mixed cooperative-competitive incentives. From mixed-autonomy traffic, to social network moderation and climate change negotiations, games with conflicting agent interests are ubiquitous. This places such general-sum settings at the core of the Cooperative AI research agenda (Dafoe et al. 2021). However, machine learning in general-sum games is notoriously unstable. Instead of resulting in outcomes that favour prosocial equilibria, naive learning (NL) algorithms tend to converge to inferior Pareto-dominated solutions instead. For example, as further detailed below, NL agents tend to learn to mutually defect in many social games even though mutual cooperation would achieve higher overall welfare. In a world where artificial agents are continuously adapting to changing environments, and thus learning in an online fashion while interacting with each other – or humans – this clearly puts society at a loss when prosocial solutions are desired.
Even seemingly simple social dilemmas, such as the famous Iterated Prisoner’s Dilemma (IPD), have posed a notorious challenge to multi-agent learning algorithms seeking to converge to prosocial outcomes. In the IPD, two agents play a repeated single-step matrix game (shown below) without knowing when the game will stop. The matrix game allows both agents to either “defect” (also sometimes called “betray”), or “cooperate”. If both agents cooperate, they serve one year in prison. If both agents defect, they serve two years in prison. If the agents play differing responses, then the one betraying is freed immediately, while the other goes to prison for three years.
Assuming life inside prison is worse than life outside, both agents cooperating at all times maximises overall welfare in the IPD, as it minimises the overall time spent in prison. Unfortunately, since players have an individual incentive to defect, simple game-theoretic analysis shows that if the game is repeated a finite and known number of times, there is only one equilibrium, in which both players consistently defect. However, if players expect to engage with the same co-player in the future, multiple equilibria exist, including ones where both players always choose to cooperate. A (now famous) series of tournaments organised by Axelrod and Hamilton (1981) further suggests that agents that play a Tit-For-Tat (TFT) strategy are remarkably robust to a wide range of human-designed IPD strategies. The TFT strategy begins by cooperating and then copies the co-player’s previous move from then on.
While humans might naturally design TFT strategies for the IPD, historically machine learning researchers have struggled to devise algorithms that converge to anything but mutual defection. Opponent shaping (OS) techniques (Foerster et al. 2018) are a family of algorithmic approaches where an agent actively tries to nudge its co-players' learning process toward preferred equilibria. This has enabled the development of algorithms that learn prosocial equilibria in a variety of toy general-sum games.
In this blog post, we first give an overview of recent developments in OS research before introducing our novel learning algorithm Model-Free Opponent Shaping (M-FOS). M-FOS does not require a differentiable model of the opponent, making it more versatile and robust, thereby improving on existing OS techniques. On the other hand, we show that M-FOS not only succeeds at shaping opponents towards cooperation but, in fact, frequently ends up exploiting them. As such, despite certain unresolved scalability issues, M-FOS presents itself as a good model of future agents that are both increasingly capable, but not necessarily maximally prosocial. We conclude with a discussion of its impact, and future work.
Intuitively, an intelligent agent should take the effect of its policy on the learning of co-players into account. Indeed this simple but powerful idea of Opponent Shaping (OS) has proven a fruitful avenue in developing novel multi-agent learning algorithms that allow agents in general-sum games to achieve their own goals through cooperative means. Note that, instead of “opponent”, one may want to think instead of the less antagonistic term “co-player”.
It may seem surprising that OS frequently gives rise to cooperation. Unlike consensus optimisation (Lamport 1998), in which the agents’ joint objective is to reach consensus, OS agents merely optimise their own goals, rather than some mutually-agreed compromise objective. Nonetheless, OS excels at minimising exploitability while preferring mutually beneficial outcomes in toy settings. In the following, we provide a brief history of OS techniques.
In Learning with Opponent-Learning Awareness (LOLA), Foerster et al. (2018) augment the learning objective such that the agent differentiates through their opponent’s learning step, meaning that the updates to each LOLA agent’s strategy takes into account the updates made by other agents. LOLA was one of the first general-sum learning methods to find the TFT strategy in the IPD.
While being elegant, LOLA does not provide any theoretical convergence guarantees. Indeed, Letcher et al. (2019) soon identified games, most notably the “Tandem” game, in which LOLA converges to unstable fixed points that are sub-optimal for both players. Stable Opponent-Shaping (SOS) resolves this issue by combining LOLA with Lookahead (Zhang and Lesser 2010), a precursor to OS. The authors prove that SOS locally converges to stable fixed points and avoids strict saddles in all differentiable games while retaining LOLA’s ability to take opponent dynamics into account.
LOLA and SOS are fundamentally inconsistent, however: they implicitly assume that co-players are NL agents. What if the co-player is also a LOLA or SOS agent? To remedy this shortcoming, Willi et al. (2022) introduce Higher-Order LOLA (HOLA), which can reliably shape other LOLA (or HOLA) agents. Unfortunately, HOLA requires computing a recursive update function, and does not always converge. Consistent LOLA (COLA) avoids this explicit computation by directly learning the update steps. Finally, HOLA and COLA, like their precursors, only consider one-step shaping, rather than long horizons.
Whether LOLA, SOS, HOLA, or COLA, all the OS methods introduced above have one crucial shortcoming: they require access to a differentiable model of the opponent’s policy. Simply learning such a model from interaction feedback may often not be computationally feasible, questioning the applicability of existing OS approaches to real-world settings. Even if a differentiable model of the opponent policy is available, estimating the higher-order derivatives for OS is costly (Finn et al. 2017), and differentiating through multiple learning steps at a time is neither computationally stable nor tractable.
Model-Free Opponent Shaping (M-FOS) addresses this issue by formulating OS as a meta-game, in which each meta-step is an episode of the underlying game. Using gradient-free evolutionary algorithms for the outer-loop optimisation step, M-FOS merely requires black-box access to the opponent policy. This increases its applicability to real-world settings, in which white-box access to co-player policies (or even knowledge of the co-player’s learning methods) is not necessarily guaranteed. In addition, M-FOS is perfectly capable of shaping its opponents across multiple timesteps, unlike any of its predecessors. The conceptual simplicity of M-FOS for a given player (i) can be seen in the following algorithm. Note that at each meta-step, M-FOS outputs the weights of a policy network (in line 8) that defines the agent policy playing against the opponent (-i) in the next inner loop episode.
By not requiring higher-order derivatives, M-FOS naturally avoids the infinite regress problem (“you think that I think that you think…”) of previous OS methods. However, independent learning in general-sum games is inherently unstable. To overcome the instability, M-FOS self-play uses a method first proposed in Cognitive Hierarchies (Camerer et al. 2007). Here, the co-player alternates between playing as an NL agent and, with increasing frequency, another M-FOS agent. At the end of the training, the NL element has been fully removed.
We now discuss how M-FOS performs across several general-sum game benchmark tasks, starting with the IPD. We compare M-FOS performance against other methods in a round-robin manner (see the table below). Here, Multiagent MAML (M-MAML) is a novel MAML-inspired (Finn et al. 2017) variant of gradient-based Meta-MAPG (Kim et al. 2021), and the NL agent is implemented by PPO (Schulman et al. 2017).
Clearly, M-FOS wins against other methods by a huge margin. Note that M-FOS scores even better than mutual cooperation (-1) against all methods (other than itself). But how can the NL agent possibly score less than (-2), which corresponds to blind defection? It turns out that the NL agent’s gradients do not update fast enough to play the optimal response within a single step, resulting in further exploitation from its M-FOS opponent.
This motivates evaluating a variant of NL that is allowed to “cheat”, i.e., it can observe the opponent’s next policy and then gets a sufficient budget of gradient steps to find the (near)-optimal response. Surprisingly, even against this agent M-FOS achieves an average score of -0.71. Even more interestingly, a closer look reveals that M-FOS does not actually induce a fully cooperative strategy, but rather one that responds to defection with defection, but, unlike Tit-for-Tat, answers cooperation with cooperation only most of the time, not always.
This strategy learnt by M-FOS, Extortion, which may seem all too obvious as it pervades human behaviour, was only officially discovered in 2012 (Press and Dyson 2012). (Note we use the capitalised word “Extortion” whenever referring to the specifical way in which M-FOS extorts other learning algorithms and “extortion” in all other settings, including when referring to human behaviour.) If agent A is extorted by agent B, then always cooperating is in A’s best interest. Extortion is characterised by a linear relationship between A’s and B’s returns, with A’s returns being strictly lower than B’s (as can be seen in figure below). To our knowledge, M-FOS is the first learning algorithm to learn Extortion.
Having discussed M-FOS’ impact on AI-AI cooperation settings, a natural question is how it relates to human behaviour. A recent paper by Milinski (2022) establishes that a significant minority of human players choose generous strategies in settings in which all players have equal incentives, but switch to extortion when provided additional incentives, e.g., job incentives. In contrast, M-FOS chooses to exploit others when faced with weak co-players, even without additional incentives. By extension, this may indicate that future highly intelligent agents might choose to use Extortion by default, calling for additional precautions (we further discuss possible precautions in the penultimate section).
Perhaps disturbingly, extortion can be understood as a fundamentally prosocial strategy (Milinksi 2022). After all, it does induce cooperation. However, this form of cooperation is inequitable, as one agent’s payoff dominates the other in an otherwise symmetric game. This inherent unfairness does, empirically, induce some human players to engage in disciplining: they self-sabotage such that both players are worse off. At least in repeated games, this self-sabotaging strategy might incentivise the co-player to cease, or at least mellow, its exploitative strategy. However, we do not observe this behaviour in any of the artificial agents we train M-FOS with (with the possible exception of M-FOS self-play).
As M-FOS naturally engages in the exploitation of co-players (with M-FOS as co-player being a noteworthy exception), a natural question is whether Extortion can be avoided through the imposition of additional incentives. Indeed, we empirically show that if its objective is to maximise the average reward of all players, instead of only its own reward, M-FOS will learn to shape its co-player in a maximally prosocial manner (shown in the figure below). Intuitively, in settings where M-FOS, by design, operates on a higher level of reasoning than its co-players, and is able to make use of considerably larger levels of compute, this rather sweeping power of cooperative M-FOS might not seem surprising. Proving this theoretically, however, is (at the time of writing) a subject for future work.
Nevertheless, our empirical observations may have significant ramifications when extrapolated to real-world settings. When diverse learning agents face a strong OS agent, equitable cooperation has to be ensured exogenously. This helps make the need for such additional cooperative incentives explicit to real-world policymakers and regulators.
Understanding the behaviour of very intelligent future agents in games is undoubtedly useful for casting light on the opportunities and limitations of AI-AI interactions. It seems likely that agents will reason about other agents' learning in a non-myopic way, taking into account future learning steps. M-FOS is a straightforward implementation of these ideas. Hence, studying M-FOS may yield valuable insights into the behaviour of future agents.
Given its ability to exploit a variety of commonly used simple general-sum learning algorithms, such as NL, and first-generation opponent-shaping techniques such as LOLA, M-FOS reveals these algorithms to be unsafe for real-world deployment. However, by exposing the vulnerability of these algorithms, Cooperative AI researchers get a clearer picture of potential future problems, allowing us to address them before it’s too late. Such a proactive approach to vulnerability reporting has paid off in other fields, such as cybersecurity. It is certainly a better long-term strategy than hoping that algorithms’ weaknesses will remain undetected and unexploited.
For example, as detailed in the next section, it is conceivable that meta-variants of M-FOS agents exist who operate on an even higher cognitive level than M-FOS. Such algorithms might plausibly be able to exploit M-FOS itself. While currently posing substantial engineering challenges, once practical implementations of such variants emerge, one would wish for this to be disclosed (in an analogous fashion to how cybersecurity companies and security experts disclose information on the latest cyber exploits).
The empirical observation that M-FOS agents, in the absence of other constraints, will learn to extort co-players in order to achieve a goal whenever possible, adds concrete evidence to what so far has been just a theoretical expectation. In the future, the tendency to extort will have to be addressed somehow, whether by training on a pro-social objective, or otherwise.
Milinski (2022) argues that the adoption of extortion in evolving populations is naturally limited as extortioners choose defection in self-play, thus inducing anti-social behaviour. This so-called limit frequency of extortion is estimated at circa 40% within populations of human players. M-FOS’ limit frequency of extortion is clearly 0%, as it achieves perfect cooperation in self-play. This implies that a world full of online M-FOS agents would also be one that is maximally pro-social. However, our experiments show that preventing M-FOS from extorting weaker co-learners requires the adoption of a team reward. If this social value-based decision is made, M-FOS may emerge as a powerful tool in the Cooperative AI toolbox.
In the longer term, it seems plausible that, as M-FOS (or a meta-variant thereof) becomes more popular, practitioners are incentivized to either use M-FOS itself, or, train their learning algorithms against M-FOS adversaries explicitly to avoid being extorted. If, however, defence from extortion consists in selecting learning algorithms with higher and higher cognitive abilities, then the likewise increasing computational costs associated with such methods could amplify existing societal inequalities in access to AI technology.
An alternative could be to instead rely on detecting extortion, and then either engaging in disciplining behaviour, or refusal to interact further. Similarly to real-world policing, AI agents could be given the option to report exploitative behaviour to a trusted enforcement agency, governmental or otherwise, which could follow up with appropriate measures.
We conclude this section with a brief summary of key takeaways for Cooperative AI research.
OS is a rich field of research, and the emergence of meta-opponent shaping naturally opens up a flurry of potential follow-up research. We here mention just a select few opportunities for further scientific investigation.
First, the M-FOS meta-agent outputs the weights of an agent policy network. In the IPD, this policy network can be represented with very few weights. However, high-dimensional general-sum games would require scaling up this representation.
Second, M-FOS would be further strengthened by establishing theoretical guarantees. In particular, it would be exciting to establish whether M-FOS can indeed guarantee the preservation of stable fixed points in arbitrary differentiable games – a feat that, so far, SOS is the only OS technique to achieve.
Third, one may wish to investigate how stable M-FOS self-play is in practice. For example, are there situations where two M-FOS agents learn to converge toward bad equilibria? If so, could we somehow overcome such brittleness? On the one hand, self-play might not seem like a particularly interesting setting in the real-world due to an inability to enforce the co-player’s choice of learning algorithm. However, if stable, M-FOS could be a candidate for standardisation efforts in settings in which the choice of learning algorithm can be enforced or incentivised (cf. various emergent industry standards on autonomous driving).
Fourth, as a meta-learning technique, M-FOS requires many training episodes. So how can we achieve model-free OS when training episodes may have dangerous/deadly consequences, as in the real-world autonomous driving, for example? Gaining insights into this seems central to the safety of OS.
Fifth, it would be exciting to see whether M-FOS can successfully shape a more extensive set of diverse co-players, including humans. For example, given that human players can save themselves from M-FOS Extortion through disciplining, under what conditions do artificial learning algorithms learn to discipline? Is the ability to discipline a crucial ingredient to cooperative intelligence? Understanding this aspect may be helpful for constructing a general research program towards making general-sum learning algorithms robust against exploitation.
Sixth, as hinted at in the previous chapter, similarly to how HOLA and COLA implement higher-order learning versions of LOLA, it is conceivable that Meta-M-FOS (or even higher-order variants) can be formulated. Such higher-order variants would presumably be able take training against M-FOS co-players into account, and thus learn to exploit M-FOS. How can we make existing real-world learning agents robust against exploitation by superior technology (or access to computational resources) in such a case?
Last but not least, we believe that M-FOS is a useful model of future artificial agents. M-FOS may serve as a vital research tool to prevent future agents from extorting humans and other AI agents. We suggest the Cooperative AI community seizes this unique research opportunity. In this case, we are confident that releasing M-FOS to the public domain will have a decidedly net-positive contribution to society, despite its apparent potential for undesirable dual-use.
R. Axelrod and W. Hamilton, "The Evolution of Cooperation," Science (211:4489), Pages 1390-1396. 1981.
C. F. Camerer, T.-H. Ho and J.-K. Chong, "A Cognitive Hierarchy Model of Games," The Quarterly Journal of Economics (119:3), Pages 861-898. 2004.
A. Dafoe, Y. Bachrach, G. Hadfield, E. Horvitz, K. Larson and T. Graepel, "Cooperative AI: Machines Must Learn to Find Common Ground," Nature (593:7857), Pages 33-36. 2021.
C. Finn, P. Abbeel and S. Levine, "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks," in Proceedings of the 34th International Conference on Machine Learning, Pages 1126–1135. 2017.
J. Foerster, R. Y. Chen, M. Al-Shedivat, S. Whiteson, P. Abbeel and I. Mordatch, "Learning with Opponent-learning Awareness," in Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, Pages 122-130. 2018.
L. Lamport, "The part-time parliament," ACM Transactions on Computer Systems (16:2), Pages 133-169. 1998.
A. Letcher, J. N. Foerster, D. Balduzzi, T. Rocktäschel and S. Whiteson, "Stable Opponent Shaping in Differentiable Games," in 7th International Conference on Learning Representations. 2019.
C. Lu, T. Willi, C. A. S. de Witt and J. N. Foerster, "Model-Free Opponent Shaping," in International Conference on Machine Learning, Pages 14398-14411. 2022.
D. Kim, M. Liu, M. Riemer, C. Sun, M. Abdulhai, G. Habibi, S. Lopez-Cot, G. Tesauro and J. P. How, "A Policy Gradient Algorithm for Learning to Learn in Multiagent Reinforcement Learning," in Proceedings of the 38th International Conference on Machine Learning, Pages 5541-5550. 2021.
M. Milinski, "Extortion — A voracious prosocial strategy," Current Opinion in Psychology (44), Pages 196-201. 2022.
W. H. Press and F. J. Dyson, "Iterated Prisoner's Dilemma contains strategies that dominate any evolutionary opponent," Proceedings of the National Academy of Sciences (109:26), Pages 10409-10413. 2012.
J. Schulman, F. Wolski, P. Dhariwal, A. Radford and O. Klimov, "Proximal Policy Optimization Algorithms," arXiv:1707.06347. 2017.
T. Willi, A. Letcher, J. Treutlein and J. N. Foerster, "COLA: Consistent Learning with Opponent-Learning Awareness," in International Conference on Machine Learning, Pages 23804-23831. 2022.
C. Zhang and V. Lesser, "Multi-Agent Learning with Policy Prediction," in Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, Pages 927–934. 2010.
November 21, 2022