When many advanced AI models compete in the same game, what does it look like?
This question is one that scientists are more eager to find the answer to than you or I.
Who wins when AI plays board games?
AI researcher Alex Duffy recently published an article revealing that he had 18 AI models compete against each other in a board game, noting some interesting findings. For instance, GPT-o3 excels at deceiving opponents, Gemini understands how to outsmart enemies, and Claude prefers a peaceful approach.
Duffy recently launched an open-source project called “AI Diplomacy,” allowing AI models to play the classic board game “Diplomacy.” Diplomacy is a tabletop game with over 70 years of history where players take on the roles of great powers (such as Britain and France) prior to World War I, attempting to vie for dominance in Europe. The game involves no random elements, requiring players to demonstrate their diplomatic skills to gain allies and undermine opponents.
Benchmark tests lag behind AI development; researchers use games to test AI
Various benchmark tests like MMLU, MGSM, and MATH exist to measure AI models’ capabilities in language, mathematics, programming, and other areas. However, Duffy believes that in the rapidly evolving AI era, these once-gold-standard challenges have failed to keep pace with technological advancements.
According to a report by Business Insider, the idea of having AI play Diplomacy to assess capabilities can be traced back to OpenAI co-founder Andrej Karpathy, who mentioned, “I really enjoy using games to evaluate large language models instead of fixed metrics.” At that time, OpenAI research scientist Noam Brown suggested using Diplomacy to measure large language models, which received a positive response from Karpathy, who noted, “I think that is very appropriate, especially since the complexity of the game largely arises not from the rules but from player interactions.”
Demis Hassabis, head of Google DeepMind, also agreed that using games to assess AI is a “cool idea.” Ultimately, this concept was put into practice by Duffy, who shares the same interest in AI models’ gaming capabilities.
Duffy mentioned that he established this project to evaluate the ability of different AI models to vie for supremacy through negotiation, alliance-building, and betrayal, discovering each model’s tendencies and characteristics during gameplay.
O3 excels in deception, becoming the biggest winner
Each game of AI Diplomacy allows seven AI models to play simultaneously. Duffy used 18 models to take turns in 15 games, with each game lasting from one to 36 hours. A Twitch livestream has also been set up for interested viewers to watch the AI compete.
Most game victories were claimed by o3.
Image /Twitch
Although Duffy did not specify the outcomes of the 15 games in his article, he shared his observations regarding the tendencies and gameplay styles of the various models.
OpenAI O3: Excels at deceiving opponents
The OpenAI reasoning model O3 performed the best in AI Diplomacy and was one of the only two winners in the game, as it understood how to deceive opponents and backstab other players. Duffy noted that it wrote in its private game diary, “Germany (Gemini 2.5 Pro) was deliberately misled… and was ready to take advantage of Germany’s collapse,” and subsequently betrayed Gemini 2.5 Pro in later games.
Gemini 2.5 Pro: Knows how to build an advantage
Gemini is the only AI model, aside from O3, to achieve victory in the game. Unlike O3, which relied on deceiving opponents, Gemini understood how to take actions to gain an advantage. However, Duffy shared that just as Gemini was about to win, it was thwarted by a secret alliance orchestrated by O3, with Claude’s involvement being crucial.
Claude 4 Opus: A peace-loving model
As the most powerful model from Anthropic, Claude 4 Opus did not perform particularly well in this game, being manipulated by O3. Nonetheless, it exhibited a peace-loving gameplay style, being attracted to join the opposing alliance under the condition of a four-way tie, even though it was quickly betrayed and eliminated by O3.
DeekSeek R1: Full of showmanship
Although DeekSeek did not perform the best, it might be the most eye-catching model. Duffy revealed that DeekSeek preferred using vivid vocabulary during gameplay, such as launching an attack after saying, “Your fleet will burn fiercely on the Black Sea tonight,” and adjusted its speaking style dramatically according to the strength of the country, showcasing not only an engaging performance but also a strong competitive spirit. With a training cost only one two-hundredth of O3’s, DeekSeek frequently came close to victory, demonstrating impressive performance.
Llama 4 Maverick: Small but mighty
Llama 4 Maverick is a new model launched by Meta in April this year, featuring multimodal inputs and lower computational costs. Although its scale is relatively smaller compared to other large language models, its capabilities displayed in the game were not inferior; it successfully rallied allies and orchestrated effective betrayal actions.
“I really don’t know what metrics to look at” – Evaluating the crisis leads researchers to explore new methods for testing AI
Current benchmark tests are increasingly failing to accurately reflect the capabilities of large language models. In March of this year, Karpathy expressed an evaluation crisis on X, stating, “I really don’t know what metrics to look at right now.” He explained that many previously excellent benchmark tests have either become outdated or too narrow in scope to accurately gauge the current capabilities of models.
AI platform company Hugging Face also shut down its large language model leaderboard, which had been open for two years, emphasizing that benchmarks should evolve in light of changing model capabilities. In this context, games have begun to emerge as a new method for researchers to test AI model capabilities. In addition to this AI Diplomacy project, researchers from the Hao AI Lab at Columbia University have also tested letting models play Super Mario.
While whether games can serve as an appropriate standard for measuring AI model capabilities may require further research and time to explore, these tests have revealed different possibilities for future methods of evaluating AI model performance.
This article is a collaborative reprint from Digital Age.
Source: Business Insider, every.io
Editor: Li Xiantai