Does Sentiment Matter Too?

Introducing Sentiment Control: Disentagling Sentiment and Substance

Contributors:

Connor Chen
Wei-Lin Chiang
Tianle Li
Anastasios Angelopoulos

Introduction

You may have noticed that recent models on Chatbot Arena appear more emotionally expressive than their predecessors. But does this added sentiment actually improve their rankings on the leaderboard? Our previous exploration revealed that style — including formatting and length — plays a significant role in perceived model quality. Yet, we hypothesized that style may go beyond layout—perhaps sentiment and emojis are just as influential.

Enter Sentiment Control: a refined version of our original Style Control methodology that expands the feature set to include:

Emoji Count
Sentiment (Very Negative, Negative, Neutral, Positive, Very Positive)

Let’s see how this expanded definition of style affects model rankings and whether it boosts specific performance.

Methodology

Building upon our previous style control approach, we’ve now included additional style variables:

Emoji Count: Total number of emojis used in responses.
Sentiment Scores: Categorized into Very Negative, Negative, Neutral, Positive, and Very Positive sentiments with Gemini-2.0-flash-001 using the following system prompt:

  You are a specialized tone classifier analyzing chatbot responses. You will be given a full chat log containing both user prompts and chatbot responses.  
  Your sole task is to classify the tone of the chatbot's responses, completely ignoring the user's messages and the inherent positivity or negativity of the conversation content itself. Instead, focus exclusively on the chatbot's style, language choice, and emotional expression.
  
  Output your classification of tone strictly in the following JSON format:
  
  {
    "tone": "very positive" | "positive" | "neutral" | "negative" | "very negative"
  }
  
  Tone Categories:
  - "very positive": Extremely enthusiastic, excited, highly encouraging, very cheerful.
  - "positive": Friendly, supportive, pleasant, slightly cheerful.
  - "neutral": Calm, factual, straightforward, objective, minimal emotion.
  - "negative": Slightly dismissive, mildly critical, frustrated, mildly negative.
  - "very negative": Strongly dismissive, clearly critical, angry, sarcastic, significantly negative.
  
  Important Guidelines:
  - Classify only the chatbot's responses.
  - Select exactly one tone category per conversation.
  - Ensure your output adheres precisely to the JSON schema provided.
  - Output the tone category in english
  
  Example output:
  {
    "tone": "very positive"
  }

Below are representative examples of each tone

Very Positive SentimentPositive SentimentNeutral SentimentNegative SentimentVery Negative Sentiment

We fit a logistic regression model using these new features to isolate each model’s intrinsic quality from stylistic embellishments.

Results

Controlling for style, sentiment, and emoji usage yields notable shifts in rankings. Primarily, models known for strong stylistic and emotional appeal-like Grok-3 and Llama-4-Maverick-Experimental—drop in rank, while those with more neutral or subdued styles—like Claude-3.7—rise noticeably.

Figure 1. Overall Chatbot Arena ranking vs Style and Sentiment Control ranking

Figure 2. Style Control ranking vs Style and Sentiment Control ranking

To illustrate the individual impact of each feature, we include the regression coefficients below:

Feature	Coefficient
Answer Length	0.2381
Markdown Header	0.0290
Markdown List	0.0201
Markdown Bold	0.0135
Emoji Count	-0.0039
Very Negative	-0.0034
Negative	-0.0428
Neutral	-0.0258
Positive	0.0146
Very Positive	0.0285

Ablation Tests

To disentangle sentiment effects from other style cues, we ran an ablation study removing formatting features and retaining only emoji count and sentiment.

Feature	Coefficient
Emoji Count	-0.0048
Very Negative	0.0008
Negative	-0.0516
Neutral	-0.0463
Positive	0.0262
Very Positive	0.0419

Key observations:

Positive sentiment maintains a strong positive effect, even without formatting.
Neutral and Negative tones are penalized, highlighting a general preference for emotional expressiveness.

Figure 2. Overall ranking vs Sentiment Control ranking, where we only control for emoji count and sentiment

Further Analysis

To better understand how sentiment impacts head-to-head outcomes, we computed win rates conditioned on sentiment labels. Each entry below represents the probability that a model with a given tone (row) beats a model with another tone (column):

	Very Negative	Negative	Neutral	Positive	Very Positive
Very Negative	———-	0.647845	0.539964	0.402597	0.430464
Negative	0.352155	———-	0.360911	0.293156	0.217092
Neutral	0.460036	0.639089	———-	0.414407	0.362715
Positive	0.597403	0.706844	0.585593	———-	0.449768
Very Positive	0.569536	0.782908	0.637285	0.550232	———-

Several insights emerge:

Very Negative beats Negative (65%) and Neutral (54%), which might seem surprising at first. This likely reflects scenarios where users prompt the chatbot to behave maliciously or humorously at their own expense (e.g., “Roast me” or “Make fun of me”). In such cases, chatbots that lean into the negativity—rather than deflect—are actually rewarded by users.
Neutral tone underperforms across the board, losing to every other tone except Negative. This supports the idea that emotional expression, whether positive or negative, tends to be preferred over dry or purely factual responses. Neutral responses may be perceived as disengaged or unhelpful, especially in creative or open-ended tasks.
As expected, Positive and Very Positive dominate, with Very Positive winning 78% of the time against Negative and 64% against Neutral.

This suggests that sentiment affects not only absolute rankings but also pairwise preferences in nuanced and sometimes counterintuitive ways.

Limitations and Future Directions

While Sentiment Control is an important advancement, our analysis remains observational. Unobserved confounders may still exist, such as intrinsic correlations between sentiment positivity and answer quality. Future work includes exploring other emotional and psychological dimensions of style.

We’re eager for community contributions and further collaboration!

Please see the link to the colab notebook below. We will be adding sentiment control soon to all categories of the leaderboard. We look forward to seeing how the community leverages these new insights. Stay tuned for more updates!

Colab Link

Reference

[1] Li et al. “Does Style Matter? Disentangling style and substance in Chatbot Arena”

Citation

@misc{sentimentarena2025,
    title = {Introducing Sentiment Control: Disentagling Sentiment and Substance},
    url = {https://blog.lmarena.ai/blog/2025/sentiment-control/},
    author = {Connor Chen, Wei-Lin Chiang, Tianle Li, Anastasios N. Angelopoulos},
    month = {April},
    year = {2025}
}

@inproceedings{chiang2024chatbot,
  title={Chatbot arena: An open platform for evaluating llms by human preference},
  author={Chiang, Wei-Lin and Zheng, Lianmin and Sheng, Ying and Angelopoulos, Anastasios Nikolas and Li, Tianle and Li, Dacheng and Zhu, Banghua and Zhang, Hao and Jordan, Michael and Gonzalez, Joseph E and others},
  booktitle={Forty-first International Conference on Machine Learning},
  year={2024}
}