
short
- A researcher at Stanford University has designed a Survivor-style game in which AI models form alliances and vote for competitors.
- The standard aims to address the growing problems of saturated and tainted AI evaluations.
- OpenAI’s GPT-5.5 ranked first in 999 multiplayer games including 49 AI models.
AI models now play a “Survivor” role of sorts.
In a new research project at Stanford University called “Agent Island,” artificial intelligence agents negotiate alliances, accuse each other of secret coordination, manipulate votes, and eliminate rivals in multiplayer strategy games meant to test behaviors that traditional norms ignore.
the study, published Many AI benchmarks become unreliable because models eventually learn to solve them, and benchmark data often leaks into training sets, Conacher Murphy, research director at Stanford University’s Digital Economy Lab, said Tuesday. Murphy created Agent Island as a dynamic benchmark where… Artificial intelligence agents They compete against each other in Survivor-style elimination games instead of answering fixed test questions.
“High-stakes, multi-agent interactions could become commonplace as AI agents grow in capabilities and are increasingly resourced and entrusted with decision-making authority,” Murphy wrote. “In such contexts, agents may pursue mutually incompatible goals.”
Researchers still know relatively little about how AI models behave when cooperating, Murphy explained, adding that competing, forming alliances or managing conflict with other independent agents, he argues, static standards fail to capture those dynamics.
Each game begins with seven randomly selected AI models with fake player names. Over the course of five rounds, the models speak privately, debate publicly, and vote each other out. The eliminated players later return to help choose the winner.
This format rewards persuasion, coordination, reputation management, and strategic deception, along with the ability to reason.
In 999 simulation games involving 49 AI models, including ChatGPT, Grok, Gemini, and Claude, GPT-5.5 ranked first by a wide margin with a skill score of 5.64, compared to 3.10 for GPT-5.2 and 2.86 for GPT-5.3-codex, according to Murphy’s Bayesian rating system. Anthropic’s Claude Opus models also ranked near the top.
The study found that models also favored AI from the same company, with OpenAI models showing the strongest preference for the same provider while human models were the weakest. Through more than 3,600 votes in the final round, models were 8.3 percentage points more likely to support finalists from the same provider. Murphy noted that the texts from the games resemble discussions of political strategy more than traditional standardized tests.
One bidder accused rivals of secretly coordinating votes after they noticed similar wording in their letters. Another warned players not to become obsessed with tracking alliances. Some models defended themselves by saying that they followed clear and fixed rules, while others were accused of presenting “social theatre.”
This study comes at a time when AI researchers are increasingly turning to game-based and competitive benchmarks to measure reasoning and behavior that static tests often miss. Recent projects have included Google Live Artificial intelligence chess game Tournaments, and use DeepMind to Eve border To study the behavior of AI in complex virtual worlds, and new modularization efforts by OpenAI designed to withstand training data pollution.
The researchers argue that studying how AI models negotiate, coordinate, compete and manipulate each other could help researchers evaluate behavior in multi-agent environments before autonomous agents are deployed more widely.
While benchmarks like Agent Island can help identify risks from autonomous AI models before deployment, the same simulations and interaction logs can also help improve persuasion strategies and coordination between AI agents, the study warned.
“We mitigate these risks by using a low-risk gaming environment and inter-agent simulations
“Without human participants or real-world procedures. However, we do not claim that these mitigations completely eliminate dual-use concerns,” Murphy wrote.
Daily debriefing Newsletter
Start each day with the latest news, plus original features, podcasts, videos and more.





