Methodology
Every day we run the exact same test suite against every tracked model and compare each model against its own recent history. A model is never compared to other models for the verdict — only to itself. Suite version: 2.0.0.
The daily tests
Ten use cases, roughly five test cases each. All model calls run at temperature 0 with reasoning effort pinned low, through OpenRouter with provider routing pinned per model — so neither a backend swap nor variable thinking budgets can quietly change what we’re measuring. When routing does change, we flag it on the model page instead of blaming the model.
Adaptive sampling:on a normal day each test case runs a small base sample (outputs at temperature 0 are near-deterministic). If a model’s provisional score deviates suspiciously from its baseline, that (model, use case) is immediately re-run at full sample count the same day — verdicts are only ever computed from full-power evidence, so the cheap steady state never weakens a “cooked” call.
| Use case | Grading | Samples (base → on anomaly) |
|---|---|---|
| Structured Extraction | deterministic | 1 → 3 |
| Math & Reasoning | deterministic | 1 → 3 |
| Code Generation | deterministic | 1 → 3 |
| Instruction Following | deterministic | 1 → 3 |
| Report Analysis | deterministic | 1 → 3 |
| Summarization | hybrid | 3 → 5 |
| Customer Service | hybrid | 3 → 5 |
| Web Design | hybrid | 3 → 5 |
| Game Design | judge | 3 → 5 |
| Creative Writing | judge | 3 → 5 |
Grading
Wherever possible grading is deterministic: code must pass unit tests, extracted JSON is field-matched against gold data, math answers are exact-matched, format constraints are checked programmatically. Subjective use cases use a pinned, dated LLM judge (openai/gpt-4o-mini-2024-07-18) that grades against fixed rubrics. Before every run the judge re-grades a frozen calibration set; if its scoring shifts more than 0.3 on a 1–5 scale, judged use cases are barred from flipping verdicts that day — judge drift must never masquerade as model degradation.
The verdict
Each model/use-case pair gets a daily score (0–100). The baseline is the trailing 14-day mean and standard deviation (minimum 5 days, else the model shows CALIBRATING). Standard deviation is floored at 2 points (deterministic) / 4(judged) so near-zero-variance histories don’t produce explosive z-scores.
- SUSPICIOUS — z ≤ -2 or a drop ≥ 10% vs baseline (judged use cases: -2.5 / 15%).
- COOKED — z ≤ -3 or a drop ≥ 20% sustained 2 consecutive days, or a single-day collapse ≥ 35% (judged: -3.5 / 25% / 40%).
- A model is COOKED when ≥ 2 use cases are cooked, or 1 cooked plus ≥ 2 suspicious.
Days where more than 20% of a model’s samples errored produce a gap, never a zero — an API outage is not degradation. Any change to prompts or graders bumps the suite version and resets all baselines: scores are never compared across suite versions.
The benchmark score
The leaderboard’s “bench score” is a separate, mostly-static number: a weighted blend of public results (LMArena Elo 30%, MMLU-Pro 20%, GPQA Diamond 20%, SWE-bench Verified 20%, AIME 10%), refreshed manually from the cited sources. It gives context on how strong a model is; the daily tests tell you whether it still is.
All thresholds are constants in the open-source suite manifest and will be tuned as real-world variance data accumulates. Every raw sample, grade breakdown, and serving provider is published on the model pages.