Latest

Preference Proxy Evaluations

Preference Proxy Evaluations

Most LLMs are optimized using an LLM judge or reward model to approximate human preference. These training processes can cost hundreds of thousands or millions of dollars. How can we know whether to trust an LLM judge or reward model, given its critical role in guiding LLM training?

Agent Arena

Agent Arena

With the growing interest in Large Language Model (LLM) agents, there is a need for a unified and systematic way to evaluate agents.

Statistical Extensions of the Bradley-Terry and Elo Models

Statistical Extensions of the Bradley-Terry and Elo Models

Chatbot Arena uses the Bradley-Terry model for the purposes of statistical inference on the model strength. Recently, we have developed some extensions of the Bradley-Terry model, and the closely related Elo model, for the purpose of binary-comparison inference problems.

RedTeam Arena

RedTeam Arena

We are excited to launch RedTeam Arena, a community-driven redteaming platform, built in collaboration with Pliny and the BASI community!

Does Style Matter?

Does Style Matter?

We controlled for the effect of length and markdown, and indeed, the ranking changed. This is just a first step towards our larger goal of disentangling substance and style in Chatbot Arena leaderboard.

Chatbot Arena Conversation Dataset Release

Chatbot Arena Conversation Dataset Release

Since its launch three months ago, Chatbot Arena has become a widely cited LLM evaluation platform that emphasizes large-scale, community-based, and interactive human evaluation. In that short time span, we collected around 53K votes from 19K unique IP addresses for 22 models. In this blog post, we are releasing an

The Multimodal Arena is Here!

The Multimodal Arena is Here!

You can now chat with your favorite vision-language models from OpenAI, Anthropic, Google, and most other major LLM providers to help discover how these models stack up against each other. Contributors: Christopher Chou* Lisa Dunlap* Wei-Lin Chiang Ying Sheng Lianmin Zheng Anastasios Angelopoulos Trevor Darrell Ion Stoica Joseph E. Gonzalez