LMArena Blog
  • Home
  • About

Research

Chatbot Arena Categories

Chatbot Arena Categories

By grouping tasks into categories, we can assess models’ strengths and weaknesses in a more granular way.
LMArena Team 30 Oct 2024
Preference Proxy Evaluations

Preference Proxy Evaluations

Most LLMs are optimized using an LLM judge or reward model to approximate human preference. These training processes can cost hundreds of thousands or millions of dollars. How can we know whether to trust an LLM judge or reward model, given its critical role in guiding LLM training?
LMArena Team 20 Oct 2024
Agent Arena

Agent Arena

With the growing interest in Large Language Model (LLM) agents, there is a need for a unified and systematic way to evaluate agents.
LMArena Team 03 Oct 2024
Does Style Matter?

Does Style Matter?

We controlled for the effect of length and markdown, and indeed, the ranking changed. This is just a first step towards our larger goal of disentangling substance and style in Chatbot Arena leaderboard.
LMArena Team 29 Aug 2024
Chatbot Arena Conversation Dataset Release

Chatbot Arena Conversation Dataset Release

Since its launch three months ago, Chatbot Arena has become a widely cited LLM evaluation platform that emphasizes large-scale, community-based, and interactive human evaluation. In that short time span, we collected around 53K votes from 19K unique IP addresses for 22 models. In this blog post, we are releasing an
LMArena Team 20 Jul 2024
  • LMArena.ai
LMArena Blog © 2025. Powered by Ghost