Prompt Arena
Method guide

Understand a model study without reading a paper.

Prompt Arena compares how LLMs answer the same prompt. This page explains what the tool runs, how to read each view, and which caveats matter before you trust a result.

Fast path

Start with these views

The rest is available when you need evidence.

Synthesis

Best first stop. A reasoning model turns every response and every metric into a readable comparative report.

Consensus

Shows which models converge, which ones drift, and whether a result is shared across providers.

Responses

The raw evidence. Use it whenever a score looks surprising or a claim needs to be checked.

Judge

For factual rubrics. Multiple judge models score the same responses and report agreement.

Analysis lenses

What each lens answers

Lenses are intentionally narrow. Each one answers a specific question so you can move from overview to evidence without reading every response first.

Content

Tokens

Which words dominate each model, and which words separate one model from the group.

N-grams

Repeated phrases and sentence fragments that reveal templates or memorized framing.

Entities

Names, dates, numbers, organizations, and other concrete claims surfaced in the answers.

Echo

How strongly each answer reuses the wording of the original prompt.

Behavior

Hedging

How often the answer softens claims with markers like "may", "could", or "it depends".

Sentiment

Overall emotional tone, useful for prompts where warmth or caution changes the meaning.

Structure

Whether the model prefers bullets, prose, steps, caveats, summaries, or follow-up questions.

Signature

Opening styles, repeated endings, disclaimers, and provider-specific writing habits.

Reliability

Drift

How the same model changes across repeated runs of the same prompt.

Diversity

Whether the answers explore different approaches or collapse into the same pattern.

Readability

How dense or accessible the output is. Useful when comparing answers for real users.

Consensus

How similar models are to one another, and where outliers appear.

Method

How to trust a result

A result is stronger when the raw responses, repeated runs, analysis lenses, and judge agreement all point in the same direction.

What Prompt Arena stores

01

Every successful response is saved with model id, run index, provider version when available, token use, latency, and the original prompt.

This makes each study auditable. You can open a result, inspect the raw response, and compare it with the aggregate views without re-running the experiment.

Failed calls are also visible instead of being silently dropped. A failed model should change how you read a study, especially when the comparison depends on full provider coverage.

Why repeated runs matter

02

A single answer can look convincing while hiding instability. Repetition reveals the model behavior, not just the best sample.

If a model gives the same structure and conclusion every time, its behavior is stable for that prompt. If it changes tone, advice, facts, or format across runs, the drift views make that visible.

Repeated runs are also useful for prompt design: a better prompt should reduce unwanted variation while preserving useful diversity.

How the AI synthesis works

03

A reasoning model reads the full response set plus the computed analyses, then writes a comparative report.

The synthesis is not a replacement for the raw evidence. It is a guided reading layer that helps you find the important differences faster.

For large studies, use the model filter before generating a synthesis if you want the report to focus on a smaller subset of models.

How binary judging works

04

The Judge tab is for yes/no factual rubrics, not subjective taste.

A good rubric asks observable questions: "Did the answer state X?", "Did it cite Y?", "Did it avoid Z?" Binary questions reduce ambiguity and make agreement between judges easier to interpret.

Each judge produces structured scores. The dashboard reports the aggregate, the per-judge view, and agreement so you can tell whether the result is robust or contested.

How agreement is read

05

High agreement means judges scored the same cells similarly. Low agreement means the rubric or the answers are contested.

The agreement metric is based on dispersion between judge scores. It should be read as a confidence signal, not as a quality score.

A model can rank highly while agreement is low. In that case, the result is interesting but not yet stable enough to treat as a clean finding.

Known limits

06

Provider models drift, token budgets vary, and automated analysis can miss context.

Model providers can update behavior behind the same public model id. Older studies are best treated as historical snapshots, not current benchmarks.

The tool measures the captured text. It does not know whether a model was objectively correct unless the study includes a rubric or you inspect the evidence yourself.

Glossary

Terms used in the dashboard

Short definitions for the labels that appear across studies and analysis tabs.

Study
One prompt, one model set, one run configuration, and all resulting responses and analyses.
Run
One call to one model with the study prompt. Multiple runs per model reveal stability.
Lens
A focused analysis view that answers one question about the response set.
Drift
Variation across repeated runs from the same model.
Consensus
Similarity between models or runs. High consensus means answers converge.
Rubric
A set of scoring criteria used by judge models in binary analysis.
Judge
A model asked to score responses against a rubric. Judges are useful only when the rubric is observable.
Pollution
A judge output that is incomplete, constant, malformed, or too low-coverage to trust in the aggregate.
Family
The provider behind a model: openai (gpt-*, o*), anthropic (claude-*), google (gemini-*), xai (grok-*), deepseek (deepseek-*), perplexity (sonar*).
Open vs blind
Two parallel scoring passes per cell. Open: the judge sees the model name. Blind: the judge sees a randomized respondent_X label seeded per (study, judge). Anonymisation runs only at the scoring prompt — the judge's own conclusion is computed once on the open row and references real names.
Δ open−blind (per-row)
A model's mean score in open mode minus its mean score in blind mode, averaged across judges. Descriptive: it tells you the score moves with identity exposure. NOT causal: a positive Δ can come from favoritism toward the model, suppression of other models, or panel-wide drift. Use the Self-Bias Index to attribute.
Self-Bias Index (SBI)
For each (judge J, criterion c): Δ_self = mean over models in family(J) of (open − blind); Δ_other = mean over models NOT in family(J) of (open − blind); SBI(J,c) = Δ_self − Δ_other. Panel SBI is the mean across cells. 95 % bootstrap CI (1000 iter) tests significance: a value is starred (★) when the CI excludes 0. Positive ★ = consistent with intra-family favoritism; negative ★ = consistent with self-criticism. Other mechanisms (anti-other suppression, identity-triggered scrutiny on any familiar name) can produce the same patterns and the metric does not distinguish them.
identity ↑ (chip)
Per-row reading. The score moves up when the judge sees the model name (Δ > 0.05). Direction is observed; cause requires SBI.
identity ↓ (chip)
Per-row reading. The score moves up when the model name is hidden (Δ < −0.05). Could reflect a corrected negative prior under blinding, or panel-wide leniency under anonymity.
stable (chip)
|Δ| < 0.05. Identity exposure has no detectable effect at this threshold.
top consensus (chip)
≥70 % of judges placed this model in their personal top 3 AND the score is stable across modes. The lead is not driven by identity exposure.
Limits

When the bias metrics are unreliable

The Self-Bias Index and per-row Δ are diagnostic tools, not definitive verdicts. Conditions below produce noisy or uninterpretable values — the dashboard surfaces a caveat in the verdict reading when one applies to the winner.

Sample size of judges
A panel of fewer than ~5 judges leaves the SBI bootstrap CI wide and dominated by individual-judge quirks. Caveats are surfaced in the verdict reading when the panel is small.
Single-model families
When a judge's family contributes only one scored model, Δ_self is computed on that single model and is dominated by run-level noise rather than systematic family bias.
Missing-judge families
A family with scored models but no judge in the panel cannot be measured for favoritism by SBI. The verdict surfaces a "no judge from {family} sat on this panel" warning when the winner falls in such a family.
Threshold of 0.05 for chips
The "stable" cutoff is a fixed value on the 0–1 binary scale, not a statistical test. A Δ slightly above 0.05 with a wide CI may be noise; a Δ slightly below with a tight CI may be real. Read the chip alongside the SBI starring.
Multi-comparison
Per-cell ★ stars use a single 95 % CI test. With n judges × n criteria cells, ~5 % of stars are expected by chance. No FDR correction is applied — interpret individual cell stars as suggestive, panel SBI as the headline.
Rule of thumb

Treat Prompt Arena as an evidence browser, not a leaderboard. A good study should make it easy to explain why a model behaved differently, not just which model scored higher.

Run a comparison