Methodology

Every day we run the exact same test suite against every tracked model and compare each model against its own recent history. A model is never compared to other models for the verdict — only to itself. Suite version: 2.0.0.

The daily tests

Ten use cases, roughly five test cases each. All model calls run at temperature 0 with reasoning effort pinned low, through OpenRouter with provider routing pinned per model — so neither a backend swap nor variable thinking budgets can quietly change what we’re measuring. When routing does change, we flag it on the model page instead of blaming the model.

Adaptive sampling:on a normal day each test case runs a small base sample (outputs at temperature 0 are near-deterministic). If a model’s provisional score deviates suspiciously from its baseline, that (model, use case) is immediately re-run at full sample count the same day — verdicts are only ever computed from full-power evidence, so the cheap steady state never weakens a “cooked” call.

Use case	Grading	Samples (base → on anomaly)
Structured Extraction	deterministic	1 → 3
Math & Reasoning	deterministic	1 → 3
Code Generation	deterministic	1 → 3
Instruction Following	deterministic	1 → 3
Report Analysis	deterministic	1 → 3
Summarization	hybrid	3 → 5
Customer Service	hybrid	3 → 5
Web Design	hybrid	3 → 5
Game Design	judge	3 → 5
Creative Writing	judge	3 → 5

Grading

Wherever possible grading is deterministic: code must pass unit tests, extracted JSON is field-matched against gold data, math answers are exact-matched, format constraints are checked programmatically. Subjective use cases use a pinned, dated LLM judge (openai/gpt-4o-mini-2024-07-18) that grades against fixed rubrics. Before every run the judge re-grades a frozen calibration set; if its scoring shifts more than 0.3 on a 1–5 scale, judged use cases are barred from flipping verdicts that day — judge drift must never masquerade as model degradation.

The verdict

Each model/use-case pair gets a daily score (0–100). The baseline is the trailing 14-day mean and standard deviation (minimum 5 days, else the model shows CALIBRATING). Standard deviation is floored at 2 points (deterministic) / 4(judged) so near-zero-variance histories don’t produce explosive z-scores.

SUSPICIOUS — z ≤ -2 or a drop ≥ 10% vs baseline (judged use cases: -2.5 / 15%).
COOKED — z ≤ -3 or a drop ≥ 20% sustained 2 consecutive days, or a single-day collapse ≥ 35% (judged: -3.5 / 25% / 40%).
A model is COOKED when ≥ 2 use cases are cooked, or 1 cooked plus ≥ 2 suspicious.

Days where more than 20% of a model’s samples errored produce a gap, never a zero — an API outage is not degradation. Any change to prompts or graders bumps the suite version and resets all baselines: scores are never compared across suite versions.

The benchmark score

The leaderboard’s “bench score” is a separate, mostly-static number: a weighted blend of public results (LMArena Elo 30%, MMLU-Pro 20%, GPQA Diamond 20%, SWE-bench Verified 20%, AIME 10%), refreshed manually from the cited sources. It gives context on how strong a model is; the daily tests tell you whether it still is.

All thresholds are constants in the open-source suite manifest and will be tuned as real-world variance data accumulates. Every raw sample, grade breakdown, and serving provider is published on the model pages.