Chatbot Arena Policy

Live and Community-Driven LLM Evaluation
Last Updated: May 27, 2025

Our Mission

Chatbot Arena (lmarena.ai) is an open-source project created by members from LMSYS and UC Berkeley SkyLab. Our mission is to advance LLM development and understanding through live, open, and community-driven evaluations. We maintain the open evaluation platform for any user to rate LLMs via pairwise comparisons under real-world use cases and publish leaderboard periodically.

Our Progress

Chatbot Arena was first launched in May 2023 and has emerged as a critical platform for live, community-driven LLM evaluation, attracting millions of participants and collecting over 800,000 votes. This extensive engagement has enabled the evaluation of more than 90 LLMs, including both commercial GPT-4, Gemini/Bard and open-weight Llama and Mistral models, significantly enhancing our understanding of their capabilities and limitations.

Our periodic leaderboard and blog post updates have become a valuable resource for the community, offering critical insights into model performance that guide the ongoing development of LLMs. Our commitment to open science is further demonstrated through the sharing of user preference data and one million user prompts, supporting research and model improvement.

We also collaborate with open-source and commercial model providers to bring their latest models to community for preview testing. We believe this initiative helps advancing the field and encourages user engagement to collect crucial votes for evaluating all the models in the Arena. Moreover, it provides an opportunity for the community to test and provide anonymized feedback before the models are officially released.

The platform’s infrastructure (FastChat) and evaluation tools, available on GitHub, emphasize our dedication to transparency and community engagement in the evaluation process. This approach not only enhances the reliability of our findings but also fosters a collaborative environment for advancing LLMs.

In our ongoing efforts, we feel obligated to establish policies that guarantee evaluation transparency and trustworthiness. Moreover, we actively involve the community in shaping any modifications to the evaluation process, reinforcing our commitment to openness and collaborative progress.

Our Policy

Transparent: The model evaluation and ranking pipelines have been open-sourced in the FastChat repository. We release data collected from the platform as well. Together, this means anyone can audit our leaderboard using publicly released data. The methodology and technical details behind LMArena have been published in a sequence of academic papers (1, 2, 3). Furthermore, many of the changes and improvements to our evaluation process are driven by community feedback.

Listing models on the leaderboard: The public leaderboard will only include models that are generally available to the public. Specifically, models must meet at least one of the following criteria to qualify as publicly released models:

Open Weights: the model’s weights are publicly accessible.
Public APIs: The model is accessible via an API (e.g., OpenAI’s GPT-4o, Anthropic’s Claude) with transparent pricing and documentation.
Public Services: The model is available through a widely accessible public-facing service (e.g., Gemini App, ChatGPT).

Once a publicly released model is listed on the leaderboard, the model will remain accessible at lmarena.ai for at least two weeks for the community to evaluate it.

The leaderboard distinguishes between first-party endpoints and third-party endpoints:

First-party endpoints: Endpoints hosted by the model’s original creator (e.g., GPT-4 by OpenAI).
Third-party endpoints: Endpoints provided by a different entity using models developed by another creator (e.g., a company, other than Meta, offering an endpoint based on Llama).

We prioritize listing first-party endpoints by default but may include third-party endpoints under the following conditions:

The provider must disclose all first-party open-source and proprietary models used in the endpoint (e.g., Llama, GPT-4o).
The endpoint must be version-controlled and remain static throughout the period of leaderboard listing.

Third-party endpoints will be explicitly labeled as “third-party” on the leaderboard.

Evaluating publicly released models. Evaluating such a model consists of the following steps:

Add the model to Arena for blind testing and let the community know it was added.
Accumulate enough votes until the model’s rating stabilizes.
Once the model’s rating stabilizes, we list the model on the public leaderboard.

Our testing process also generally follows a timeline as follows:

Public models will be tested until their leaderboard ranking converges: after 3,000 votes, or earlier if the confidence interval is small enough to distinguish it from surrounding models.
If more than 10 models were pre-release tested in parallel, we will mark model scores as “provisional” until 2,000 fresh votes have been collected after the model’s public release.
We will release model results to the community on the public leaderboard immediately once the ranking converges. There is one exception: the model provider can reach out before its listing and ask for an one-day heads up. In this case, we will share the rating with the model provider and wait for an additional day before listing the model on the public leaderboard.
The sampling weight of a model is set to 5 until 3,000 votes are collected. Then after release we assign the sampling weight to 1. Models may be retired after 3,000 votes if there are two more recent models in the same series (e.g. gpt-4o-0513 and gpt-4o-0806) and/or if there are more than 3 providers that offer models cheaper or same price and strictly better (according to overall Arena Score) than this model.
The top-10 models according to the overall Arena Score will be given a sampling weight of 3. This is to ensure the best community experience when visiting our site.
The best model from the top-10 providers according to the overall Arena Score will be given a sampling weight of 1 to ensure diversity of battles.

This policy may be modified moving forward; visit this website for the most recent version.

Evaluating unreleased models: We collaborate with open-source and commercial model providers to bring their unreleased models to community for preview testing.

Model providers can test their unreleased models anonymously, meaning the models’ names will be anonymized. A model is considered unreleased if its weights are neither open, nor available via a public API or service. Evaluating an unreleased model consists of the following steps:

Add the model to Arena with an anonymous label. i.e., its identity will not be shown to users.
Keep it until we accumulate enough votes for its rating to stabilize or until the model provider withdraws it.
Once we accumulate enough votes, we will share the result privately with the model provider. These include the rating, as well as release samples of up to 20% of the votes. (See Sharing data with the model providers for further details).
Remove the model from Arena.

If while we test an unreleased model, that model is publicly released, we immediately switch to the publicly released model evaluation process. Model providers are all allowed to test multiple variants of their models pre-release, subject to our system’s constraints.

To ensure the leaderboard accurately reflects model rankings, we rely on live comparisons between models. Hence, we may deprecate models from the leaderboard one month after they are no longer available online or publicly accessible. As of June 15, 2025, models that have been retired from battle mode will be recorded in a public list for transparency and future reference.

Sharing data with the community: We will periodically share data with the community. Specifically, as of July 1, 2025, we will share 100% of the arena vote data used for the public leaderboard (model identities and votes), so the leaderboard can be reproduced. We may also share a portion of the prompts and responses, so long as users have consented to the inclusion of this data in a public dataset. For models that have not appeared on the public leaderboard, we may still release data, but the model will be labeled as “anonymous”.

Sharing data with the model providers: Upon request, we will offer early data access with model providers who wish to improve their models. In particular, with a model provider, we will share the data that includes their model’s answers. For battles, we may not reveal the opponent model, labeling it instead as “anonymous”. If the model is not on the leaderboard at the time of sharing, the model’s answers will also be labeled as “anonymous”. Before sharing the data, we will remove user PII (e.g., Azure PII detection for texts).

FAQ

Why another eval?

Most LLM benchmarks are static, which makes them prone to contamination, as these LLMs are trained on most available data on the Internet. Chatbot Arena aims to alleviate this problem by providing live evaluation with a continuous stream of new prompts from real people. We also believe that the open nature of the platform will attract users that accurately reflect the broader set of LLM users and real use cases.

What model to evaluate? Why not all?

We will continuously add new models and retire old ones. It is not feasible to add every possible model due to the cost and the scalability of our evaluation process, i.e., it might take too much to accumulate enough votes to accurately rate each model. Today, the decision to add new models is rather ad-hoc: we add models based on the community’s perceived interest. We intend to formalize his process in the near future.

Why should the community trust our eval?

We seek to provide transparency and all tools as well as the platform we are using in open-source. We invite the community to use our platform and tools to statistically reproduce our results.

How do you handle security?

Security is always a concern, particularly against DDOS and Sybil attacks. Our safeguards currently include:

Malicious user detection: As outlined in Section 5.1 of the original Chatbot Arena paper, we identify and handle malicious behavior effectively.
Cloudflare bot protection: Applied to all web traffic to defend against automated attacks.
Google reCAPTCHA v3: Enabled for each chat and vote to minimize spam and bot activity.
Category pipeline: Filters out low-quality data and organizes incoming data into relevant categories.
Vote limitations: Limits the number of votes per IP address per day to prevent abuse.
Prompt deduplication: Deduplicates overly frequent prompts (e.g., top 0.1%) to ensure a diverse dataset.
Delayed leaderboard release: Allows us to manually address any issues that may arise.

We are also planning to introduce further defenses, such as statistical methods and optional user login systems, to enhance security. If you have ideas on how to further improve our security measures, please reach out to us!

Any feedback?

Feel free to send us email or leave feedback on Github!