Unlock stock picks and a broker-level newsfeed that powers Wall Street.

Study accuses LM Arena of helping top AI labs game its benchmark

In This Article:

A new paper from AI lab Cohere, Stanford, MIT, and Ai2 accuses LM Arena, the organization behind the popular crowdsourced AI benchmark Chatbot Arena, of helping a select group of AI companies achieve better leaderboard scores at the expense of rivals.

According to the authors, LM Arena allowed some industry-leading AI companies like Meta, OpenAI, Google, and Amazon to privately test several variants of AI models, then not publish the scores of the lowest performers. This made it easier for these companies to achieve a top spot on the platform's leaderboard, though the opportunity was not afforded to every firm, the authors say.

"Only a handful of [companies] were told that this private testing was available, and the amount of private testing that some [companies] received is just so much more than others," said Cohere's VP of AI research and co-author of the study, Sara Hooker, in an interview with TechCrunch. "This is gamification."

Created in 2023 as an academic research project out of UC Berkeley, Chatbot Arena has become a go-to benchmark for AI companies. It works by putting answers from two different AI models side-by-side in a "battle," and asking users to choose the best one. It's not uncommon to see unreleased models competing in the arena under a pseudonym.

Votes over time contribute to a model's score — and, consequently, its placement on the Chatbot Arena leaderboard. While many commercial actors participate in Chatbot Arena, LM Arena has long maintained that its benchmark is an impartial and fair one.

However, that's not what the paper's authors say they uncovered.

One AI company, Meta, was able to privately test 27 model variants on Chatbot Arena between January and March leading up to the tech giant's Llama 4 release, the authors allege. At launch, Meta only publicly revealed the score of a single model — a model that happened to rank near the top of the Chatbot Arena leaderboard.

<span class="wp-element-caption__text">A chart pulled from the study. (Credit: Singh et al.)</span>
A chart pulled from the study. (Credit: Singh et al.)

In an email to TechCrunch, LM Arena Co-Founder and UC Berkeley Professor Ion Stoica said that the study was full of "inaccuracies" and "questionable analysis."

"We are committed to fair, community-driven evaluations, and invite all model providers to submit more models for testing and to improve their performance on human preference," said LM Arena in a statement provided to TechCrunch. "If a model provider chooses to submit more tests than another model provider, this does not mean the second model provider is treated unfairly."