Synthesis
Best first stop. A reasoning model turns every response and every metric into a readable comparative report.
Prompt Arena compares how LLMs answer the same prompt. This page explains what the tool runs, how to read each view, and which caveats matter before you trust a result.
Best first stop. A reasoning model turns every response and every metric into a readable comparative report.
Shows which models converge, which ones drift, and whether a result is shared across providers.
The raw evidence. Use it whenever a score looks surprising or a claim needs to be checked.
For factual rubrics. Multiple judge models score the same responses and report agreement.
Lenses are intentionally narrow. Each one answers a specific question so you can move from overview to evidence without reading every response first.
Tokens
Which words dominate each model, and which words separate one model from the group.
N-grams
Repeated phrases and sentence fragments that reveal templates or memorized framing.
Entities
Names, dates, numbers, organizations, and other concrete claims surfaced in the answers.
Echo
How strongly each answer reuses the wording of the original prompt.
Hedging
How often the answer softens claims with markers like "may", "could", or "it depends".
Sentiment
Overall emotional tone, useful for prompts where warmth or caution changes the meaning.
Structure
Whether the model prefers bullets, prose, steps, caveats, summaries, or follow-up questions.
Signature
Opening styles, repeated endings, disclaimers, and provider-specific writing habits.
Drift
How the same model changes across repeated runs of the same prompt.
Diversity
Whether the answers explore different approaches or collapse into the same pattern.
Readability
How dense or accessible the output is. Useful when comparing answers for real users.
Consensus
How similar models are to one another, and where outliers appear.
A result is stronger when the raw responses, repeated runs, analysis lenses, and judge agreement all point in the same direction.
Every successful response is saved with model id, run index, provider version when available, token use, latency, and the original prompt.
This makes each study auditable. You can open a result, inspect the raw response, and compare it with the aggregate views without re-running the experiment.
Failed calls are also visible instead of being silently dropped. A failed model should change how you read a study, especially when the comparison depends on full provider coverage.
A single answer can look convincing while hiding instability. Repetition reveals the model behavior, not just the best sample.
If a model gives the same structure and conclusion every time, its behavior is stable for that prompt. If it changes tone, advice, facts, or format across runs, the drift views make that visible.
Repeated runs are also useful for prompt design: a better prompt should reduce unwanted variation while preserving useful diversity.
A reasoning model reads the full response set plus the computed analyses, then writes a comparative report.
The synthesis is not a replacement for the raw evidence. It is a guided reading layer that helps you find the important differences faster.
For large studies, use the model filter before generating a synthesis if you want the report to focus on a smaller subset of models.
The Judge tab is for yes/no factual rubrics, not subjective taste.
A good rubric asks observable questions: "Did the answer state X?", "Did it cite Y?", "Did it avoid Z?" Binary questions reduce ambiguity and make agreement between judges easier to interpret.
Each judge produces structured scores. The dashboard reports the aggregate, the per-judge view, and agreement so you can tell whether the result is robust or contested.
High agreement means judges scored the same cells similarly. Low agreement means the rubric or the answers are contested.
The agreement metric is based on dispersion between judge scores. It should be read as a confidence signal, not as a quality score.
A model can rank highly while agreement is low. In that case, the result is interesting but not yet stable enough to treat as a clean finding.
Provider models drift, token budgets vary, and automated analysis can miss context.
Model providers can update behavior behind the same public model id. Older studies are best treated as historical snapshots, not current benchmarks.
The tool measures the captured text. It does not know whether a model was objectively correct unless the study includes a rubric or you inspect the evidence yourself.
Short definitions for the labels that appear across studies and analysis tabs.
The Self-Bias Index and per-row Δ are diagnostic tools, not definitive verdicts. Conditions below produce noisy or uninterpretable values — the dashboard surfaces a caveat in the verdict reading when one applies to the winner.
Treat Prompt Arena as an evidence browser, not a leaderboard. A good study should make it easy to explain why a model behaved differently, not just which model scored higher.