Research

Chatbot Arena Categories

Chatbot Arena Categories

By grouping tasks into categories, we can assess models’ strengths and weaknesses in a more granular way.

Preference Proxy Evaluations

Preference Proxy Evaluations

Most LLMs are optimized using an LLM judge or reward model to approximate human preference. These training processes can cost hundreds of thousands or millions of dollars. How can we know whether to trust an LLM judge or reward model, given its critical role in guiding LLM training?

Agent Arena

Agent Arena

With the growing interest in Large Language Model (LLM) agents, there is a need for a unified and systematic way to evaluate agents.

Does Style Matter?

Does Style Matter?

We controlled for the effect of length and markdown, and indeed, the ranking changed. This is just a first step towards our larger goal of disentangling substance and style in Chatbot Arena leaderboard.

Chatbot Arena Conversation Dataset Release

Chatbot Arena Conversation Dataset Release

Since its launch three months ago, Chatbot Arena has become a widely cited LLM evaluation platform that emphasizes large-scale, community-based, and interactive human evaluation. In that short time span, we collected around 53K votes from 19K unique IP addresses for 22 models. In this blog post, we are releasing an