Benchmark Confidence & Contamination Flags

Not all benchmark scores are equally trustworthy. BenchLM now separates verified ranking from provisionalranking while still tracking the provenance of every stored score. The confidence indicator (1-4 dots) shows how much sourced benchmark coverage supports each model's score.

●●●●High

7+ categories, 20+ non-generated benchmarks

●●●○Good

5+ categories, 12+ non-generated benchmarks

●●○○Moderate

3+ categories, 8+ non-generated benchmarks

●○○○Low / Estimated

Limited sourced data, score is estimated

Confidence Distribution (Ranked Models)

High (4%)

Good (10%)

Moderate (11%)

150

Low / Estimated (75%)

How BenchLM Scores Work

Verified, provisional, and generated

Each benchmark value is tagged as manual (a hand-entered public row) or generated (inferred from related models). Generated rows are excluded from all public ranking logic. Manual rows are now split again into sourced rows for the verified leaderboard and source-unverified rows that can still appear in provisional mode.

Ranking Eligibility

A model must have at least 8 qualifying benchmarks across 2+ categories to rank in a lane. The provisional leaderboard uses rankable non-generated rows; the verified leaderboard uses sourced rows only. Models below the threshold are shown as tracked but unranked.

Category Eligibility

For category leaderboards, a model needs qualifying scores on at least half of the weighted benchmarks in that category. BenchLM computes this separately for provisional and verified ranking so sparse exact-source coverage cannot silently borrow strength from provisional rows.

Display-Only Benchmarks

Some benchmarks (MMLU, BBH, HumanEval, older AIME/HMMT variants) are shown for context but don't affect scoring. These are either saturated (top models all score 97%+) or have been superseded by harder versions.

200 models match these filters.

Model	Confidence	Prov. score	Sourced	Rankable	Generated	Coverage
Qwen3.7 Plus Alibaba	●●●●High	67.22	51	51	0	100%
Claude Opus 4.5 Anthropic	●●●●High	64.22	43	43	0	100%
Kimi K2.5 Moonshot AI	●●●●High	59.66	40	40	0	100%
Qwen3.6 Plus Alibaba	●●●●High	65.2	39	39	0	100%
GLM-5 Z.AI	●●●●High	66.06	35	35	0	100%
Qwen3.5 397B Alibaba	●●●●High	57.01	35	35	0	100%
Qwen3.7 Max Alibaba	●●●●High	72.84	33	33	0	100%
Gemini 3.5 Flash Google	●●●●High	64.75	23	23	0	100%
Qwen3.6-35B-A3B Alibaba	●●●○Good	51.47	40	40	0	100%
Qwen3.6-27B Alibaba	●●●○Good	53.82	37	37	0	100%
Claude Opus 4.6 Anthropic	●●●○Good	68.59	31	31	0	100%
GPT-5.4 OpenAI	●●●○Good	74.24	31	31	0	100%
GPT-5.5 OpenAI	●●●○Good	73.51	29	29	0	100%
Kimi K2.6 Moonshot AI	●●●○Good	56.79	29	29	0	100%
Claude Opus 4.8 Anthropic	●●●○Good	78.34	28	28	0	100%
Muse Spark Meta	●●●○Good	71.04	23	23	0	100%
GPT-5.6 Sol OpenAI	●●●○Good	81.96	22	22	0	100%
GPT-5.6 Terra OpenAI	●●●○Good	72.57	20	20	0	100%
GPT-5.6 Luna OpenAI	●●●○Good	67.17	20	20	0	100%
Claude Opus 4.7 (Adaptive) Anthropic	●●●○Good	66.27	20	20	0	100%
Gemini 3.1 Pro Google	●●●○Good	55.3	19	19	0	100%
Claude Sonnet 4.6 Anthropic	●●●○Good	65.07	19	19	0	100%
Inkling Thinking Machines Lab	●●●○Good	67.54	16	16	0	100%
GPT-5.2 OpenAI	●●●○Good	58.43	14	14	0	100%
Qwen3.5-122B-A10B Alibaba	●●●○Good	60.56	13	13	0	100%
Qwen3.5-27B Alibaba	●●●○Good	60.7	13	13	0	100%
Qwen3.5-35B-A3B Alibaba	●●●○Good	56.97	13	13	0	100%
GPT-5.4 mini OpenAI	●●●○Good	56.77	12	12	0	100%
DeepSeek V4 Pro (High) DeepSeek	●●○○Moderate	55.47	22	22	0	100%
DeepSeek V4 Flash (High) DeepSeek	●●○○Moderate	53.95	22	22	0	100%
DeepSeek V4 Pro DeepSeek	●●○○Moderate	60.66	20	20	0	100%
DeepSeek V4 Flash DeepSeek	●●○○Moderate	58.88	20	20	0	100%
Muse Spark 1.1 Meta	●●○○Moderate	77.44	19	19	0	100%
GLM-5.1 Z.AI	●●○○Moderate	67.74	18	18	0	100%
MiniMax M3 MiniMax	●●○○Moderate	69.75	16	16	0	100%
Grok 4.20 xAI	●●○○Moderate	54.68	16	16	0	100%
Claude Mythos 5 Anthropic	●●○○Moderate	83.93	15	15	0	100%
Claude Sonnet 5 Anthropic	●●○○Moderate	65.32	15	15	0	100%
GLM-5.2 Z.AI	●●○○Moderate	63.96	15	15	0	100%
Nemotron 3 Nano Omni 30B A3B NVIDIA	●●○○Moderate	44.24	14	14	0	100%
Claude Fable 5 Anthropic	●●○○Moderate	83.68	11	11	0	100%
Gemini 3 Pro Google	●●○○Moderate	67.73	11	11	0	100%
GPT-5.4 nano OpenAI	●●○○Moderate	66.79	11	11	0	100%
GPT-5.4 Pro OpenAI	●●○○Moderate	60.89	11	11	0	100%
Gemma 4 12B Google	●●○○Moderate	47.29	11	11	0	100%
Qwen 3.6 Max (preview) Alibaba	●●○○Moderate	59.72	10	10	0	100%
GLM-4.7 Z.AI	●●○○Moderate	61.16	9	9	0	100%
Claude Sonnet 4.5 Anthropic	●●○○Moderate	53.61	9	9	0	100%
Kimi K2.5 (Reasoning) Moonshot AI	●●○○Moderate	59.35	8	8	0	100%
MiMo-V2.5 Xiaomi	●●○○Moderate	58.62	8	8	0	100%
Kimi K3 Moonshot AI	●○○○Low / Estimated	~80.96	29	7	0	414%
MiniMax M2.7 MiniMax	●○○○Low / Estimated	~64.11	17	17	0	100%
Step 3.7 Flash StepFun	●○○○Low / Estimated	~50.87	10	10	0	100%
GPT-5.3 Codex OpenAI	●○○○Low / Estimated	~66.69	7	7	0	100%
MiMo-V2.5-Pro Xiaomi	●○○○Low / Estimated	~70.19	7	7	0	100%
GPT-5.5 Pro OpenAI	●○○○Low / Estimated	~63.69	6	6	0	100%
Grok 4.5 xAI	●○○○Low / Estimated	~76.72	6	6	0	100%
Claude Opus 4.7 Anthropic	●○○○Low / Estimated	~71.94	6	6	0	100%
DeepSeek V3.2 DeepSeek	●○○○Low / Estimated	~55.4	6	6	0	100%
Grok 4.3 xAI	●○○○Low / Estimated	~65.1	6	6	0	100%
LFM2.5-8B-A1B LiquidAI	●○○○Low / Estimated	~41.42	6	6	0	100%
Gemma 4 31B Google	●○○○Low / Estimated	~61.08	6	6	0	100%
GPT-4.1 OpenAI	●○○○Low / Estimated	~51.11	6	6	0	100%
Gemini 3 Flash Google	●○○○Low / Estimated	~60.49	5	5	0	100%
o3-mini OpenAI	●○○○Low / Estimated	~47.41	5	5	0	100%
GPT-4.1 mini OpenAI	●○○○Low / Estimated	~44.19	5	5	0	100%
Gemini 2.5 Pro Google	●○○○Low / Estimated	~57.25	5	5	0	100%
o1 OpenAI	●○○○Low / Estimated	~48.1	4	4	0	100%
Claude Haiku 4.5 Anthropic	●○○○Low / Estimated	~56.58	4	4	0	100%
Qwen3 235B 2507 Alibaba	●○○○Low / Estimated	~56.02	4	4	0	100%
Qwen3.5 Plus Alibaba	●○○○Low / Estimated	~47.2	4	4	0	100%
Gemma 4 26B A4B Google	●○○○Low / Estimated	~57.96	4	4	0	100%
GPT-4.1 nano OpenAI	●○○○Low / Estimated	~42.06	4	4	0	100%
Trinity-Large-Preview Arcee AI	●○○○Low / Estimated	~55.94	4	4	0	100%
Trinity-Large-Thinking Arcee AI	●○○○Low / Estimated	~52.32	4	4	0	100%
GPT-5.1 OpenAI	●○○○Low / Estimated	~53.65	3	3	0	100%
Grok 4 xAI	●○○○Low / Estimated	~60.42	3	3	0	100%
GLM-4.6 Z.AI	●○○○Low / Estimated	~55.12	3	3	0	100%
Ling 2.6 Flash InclusionAI	●○○○Low / Estimated	~43.87	3	3	0	100%
Command A+ Cohere	●○○○Low / Estimated	~47.51	3	3	0	100%
Exaone 4.0 32B LG AI Research	●○○○Low / Estimated	~40.44	2	2	0	100%
Claude 4.1 Opus Anthropic	●○○○Low / Estimated	~45.88	2	2	0	100%
Claude 4 Sonnet Anthropic	●○○○Low / Estimated	~42.79	2	2	0	100%
Gemini 3.1 Flash-Lite Google	●○○○Low / Estimated	~50.83	2	2	0	100%
o4-mini (high) OpenAI	●○○○Low / Estimated	~49.98	2	2	0	100%
Kimi K2 Moonshot AI	●○○○Low / Estimated	~27.19	2	2	0	100%
o3 OpenAI	●○○○Low / Estimated	~47.89	2	2	0	100%
Qwen3 235B 2507 (Reasoning) Alibaba	●○○○Low / Estimated	~58.01	2	2	0	100%
Gemini 2.5 Flash Google	●○○○Low / Estimated	~48.09	2	2	0	100%
Qwen3.5 Flash Alibaba	●○○○Low / Estimated	~47.65	2	2	0	100%
Grok 3 [Beta] xAI	●○○○Low / Estimated	~40.43	2	2	0	100%
Claude 3.5 Sonnet Anthropic	●○○○Low / Estimated	~47.74	2	2	0	100%
GPT-5 (high) OpenAI	●○○○Low / Estimated	~58.61	2	2	0	100%
GPT-5.2-Codex OpenAI	●○○○Low / Estimated	~59.1	2	2	0	100%
Kimi K2.7 Code Moonshot AI	●○○○Low / Estimated	~55	2	2	0	100%
GPT-5.1-Codex OpenAI	●○○○Low / Estimated	~52.72	2	2	0	100%
MiMo-V2-Flash Xiaomi	●○○○Low / Estimated	~54.06	1	1	0	100%
Gemini 3 Pro Deep Think Google	●○○○Low / Estimated	~61.31	1	1	0	100%
DeepSeek V3 DeepSeek	●○○○Low / Estimated	~44.97	1	1	0	100%
GPT-4o OpenAI	●○○○Low / Estimated	~41.49	1	1	0	100%
Llama 4 Scout Meta	●○○○Low / Estimated	~39.87	1	1	0	100%
Llama 4 Maverick Meta	●○○○Low / Estimated	~23.49	1	1	0	100%
Claude Opus 4.6 (Adaptive) Anthropic	●○○○Low / Estimated	~64.18	1	1	0	100%
GLM-5 (Reasoning) Z.AI	●○○○Low / Estimated	~59.77	1	1	0	100%
GLM-5-Turbo Z.AI	●○○○Low / Estimated	~66.89	1	1	0	100%
MiMo-V2-Pro Xiaomi	●○○○Low / Estimated	~67.78	1	1	0	100%
GLM-5V-Turbo Z.AI	●○○○Low / Estimated	~63.5	1	1	0	100%
Grok 4.1 Fast (Reasoning) xAI	●○○○Low / Estimated	~60.51	1	1	0	100%
MiMo-V2-Omni Xiaomi	●○○○Low / Estimated	~63.15	1	1	0	100%
DeepSeek V3.2 (Thinking) DeepSeek	●○○○Low / Estimated	~58.15	1	1	0	100%
Grok 4 Fast (Reasoning) xAI	●○○○Low / Estimated	~56.59	1	1	0	100%
MiniMax M2.5 MiniMax	●○○○Low / Estimated	~59.52	1	1	48	2%
GPT-OSS 120B OpenAI	●○○○Low / Estimated	~50.08	1	1	0	100%
GPT-5.1-Codex-Max OpenAI	●○○○Low / Estimated	~54.48	1	1	0	100%
GPT-OSS 20B OpenAI	●○○○Low / Estimated	~42.74	1	1	0	100%
Nemotron 3 Super 100B NVIDIA	●○○○Low / Estimated	~50.08	1	1	0	100%
GPT-5 mini OpenAI	●○○○Low / Estimated	~43.93	1	1	60	2%
Qwen3 Max Alibaba	●○○○Low / Estimated	~48.16	1	1	0	100%
Claude Opus 4.5 Thinking Anthropic	●○○○Low / Estimated	~57.44	1	1	0	100%
Grok 4.1 xAI	●○○○Low / Estimated	~59.97	0	0	0	0%
Qwen3.5 397B (Reasoning) Alibaba	●○○○Low / Estimated	~59.5	0	0	0	0%
GPT-5.2 Instant OpenAI	●○○○Low / Estimated	~59	0	0	32	0%
GPT-5.2 Pro OpenAI	●○○○Low / Estimated	~67.01	0	0	32	0%
GPT-5.3 Instant OpenAI	●○○○Low / Estimated	~58.9	0	0	35	0%
Grok 4.1 Fast xAI	●○○○Low / Estimated	~51.25	0	0	0	0%
DeepSeek V3.1 (Reasoning) DeepSeek	●○○○Low / Estimated	~53.43	0	0	0	0%
DeepSeek V3.1 DeepSeek	●○○○Low / Estimated	~53.64	0	0	0	0%
Mistral Large 3 Mistral	●○○○Low / Estimated	~50.4	0	0	0	0%
GLM-4.5 Z.AI	●○○○Low / Estimated	~57.56	0	0	0	0%
DeepSeek-R1 DeepSeek	●○○○Low / Estimated	~51.67	0	0	0	0%
GPT-5.3-Codex-Spark OpenAI	●○○○Low / Estimated	~56.91	0	0	44	0%
Step 3.5 Flash StepFun	●○○○Low / Estimated	~55.1	0	0	50	0%
o1-preview OpenAI	●○○○Low / Estimated	~49.11	0	0	0	0%
GLM-4.5-Air Z.AI	●○○○Low / Estimated	~47.7	0	0	0	0%
GLM-4.7-Flash Z.AI	●○○○Low / Estimated	~51.25	0	0	45	0%
Gemma 3 27B Google	●○○○Low / Estimated	~41.57	0	0	0	0%
Mistral Small 4 Mistral	●○○○Low / Estimated	~46.23	0	0	0	0%
DeepSeek LLM 2.0 DeepSeek	●○○○Low / Estimated	~54.53	0	0	0	0%
Mercury 2 Inception	●○○○Low / Estimated	~51.28	0	0	48	0%
GPT-5 (medium) OpenAI	●○○○Low / Estimated	~55.15	0	0	0	0%
Claude 3 Opus Anthropic	●○○○Low / Estimated	~41.13	0	0	0	0%
Nemotron 3 Nano 30B NVIDIA	●○○○Low / Estimated	~52.94	0	0	0	0%
GPT-4o mini OpenAI	●○○○Low / Estimated	~37.87	0	0	0	0%
Mistral Large 2 Mistral	●○○○Low / Estimated	~41.77	0	0	0	0%
Qwen2.5-72B Alibaba	●○○○Low / Estimated	~52.15	0	0	0	0%
Llama 3.1 405B Meta	●○○○Low / Estimated	~51.71	0	0	0	0%
Nemotron 3 Super 120B A12B NVIDIA	●○○○Low / Estimated	~50.97	0	0	49	0%
Llama 3 70B Meta	●○○○Low / Estimated	~50.82	0	0	0	0%
Qwen2.5 Coder 32B Instruct Alibaba	●○○○Low / Estimated	~34.7	0	0	0	0%
DeepSeek Coder 2.0 DeepSeek	●○○○Low / Estimated	~50.28	0	0	0	0%
Seed 1.6 ByteDance	●○○○Low / Estimated	~50.23	0	0	32	0%
Gemini 1.5 Pro Google	●○○○Low / Estimated	~35.71	0	0	0	0%
Phi-4 Microsoft	●○○○Low / Estimated	~22.69	0	0	0	0%
Qwen2.5-1M Alibaba	●○○○Low / Estimated	~49.89	0	0	0	0%
DeepSeekMath V2 DeepSeek	●○○○Low / Estimated	~49.89	0	0	0	0%
Seed-2.0-Lite ByteDance	●○○○Low / Estimated	~49.84	0	0	32	0%
Ministral 3 14B (Reasoning) Mistral	●○○○Low / Estimated	~49.34	0	0	48	0%
o3-pro OpenAI	●○○○Low / Estimated	~48.33	0	0	0	0%
Ministral 3 14B Mistral	●○○○Low / Estimated	~34.47	0	0	48	0%
Aion-2.0 Aion Labs	●○○○Low / Estimated	~48.66	0	0	32	0%
Mixtral 8x22B Instruct v0.1 Mistral	●○○○Low / Estimated	~48.54	0	0	0	0%
Grok Code Fast 1 xAI	●○○○Low / Estimated	~38.64	0	0	0	0%
GPT-4 Turbo OpenAI	●○○○Low / Estimated	~27.44	0	0	0	0%
Z-1 Z	●○○○Low / Estimated	~45.13	0	0	0	0%
Seed 1.6 Flash ByteDance	●○○○Low / Estimated	~45.08	0	0	32	0%
Nemotron-4 15B NVIDIA	●○○○Low / Estimated	~45.08	0	0	0	0%
Mistral 8x7B Mistral	●○○○Low / Estimated	~44.99	0	0	0	0%
Moonshot v1 Moonshot AI	●○○○Low / Estimated	~44.79	0	0	0	0%
Seed-2.0-Mini ByteDance	●○○○Low / Estimated	~44.6	0	0	32	0%
Nemotron Ultra 253B NVIDIA	●○○○Low / Estimated	~44.4	0	0	0	0%
Gemini 1.0 Pro Google	●○○○Low / Estimated	~21.78	0	0	0	0%
Claude 3 Haiku Anthropic	●○○○Low / Estimated	~21.39	0	0	0	0%
LFM2-24B-A2B LiquidAI	●○○○Low / Estimated	~18.87	0	0	43	0%
Claude 4.1 Opus Thinking Anthropic	●○○○Low / Estimated	~36.55	0	0	0	0%
Ministral 3 8B (Reasoning) Mistral	●○○○Low / Estimated	~40.38	0	0	48	0%
Nova Pro Amazon	●○○○Low / Estimated	~20.32	0	0	0	0%
Ministral 3 8B Mistral	●○○○Low / Estimated	~20.96	0	0	48	0%
Mistral 7B v0.3 Mistral	●○○○Low / Estimated	~39.9	0	0	0	0%
Qwen2.5-VL-32B Alibaba	●○○○Low / Estimated	~39.85	0	0	0	0%
Llama 4 Behemoth Meta	●○○○Low / Estimated	~39.8	0	0	0	0%
LFM2.5-1.2B-Thinking LiquidAI	●○○○Low / Estimated	~16.23	0	0	43	0%
Ministral 3 3B (Reasoning) Mistral	●○○○Low / Estimated	~39.52	0	0	48	0%
MiniMax M1 80k MiniMax	●○○○Low / Estimated	~25.12	0	0	43	0%
Ministral 3 3B Mistral	●○○○Low / Estimated	~18.27	0	0	48	0%
Mistral 8x7B v0.2 Mistral	●○○○Low / Estimated	~39.13	0	0	0	0%
LFM2.5-1.2B-Instruct LiquidAI	●○○○Low / Estimated	~15.52	0	0	43	0%
Hy3 Tencent	●○○○Low / Estimated	~55.64	0	0	0	0%
Gemma 4 E2B Google	●○○○Low / Estimated	~41.82	0	0	0	0%
Gemma 4 E4B Google	●○○○Low / Estimated	~43.2	0	0	0	0%
Sarvam 105B Sarvam	●○○○Low / Estimated	~42.97	0	0	0	0%
Sarvam 30B Sarvam	●○○○Low / Estimated	~40.7	0	0	0	0%
Mistral Medium 3 Mistral	●○○○Low / Estimated	~43.2	0	0	0	0%
Granite-4.0-350M IBM	●○○○Low / Estimated	~38.32	0	0	0	0%
Granite-4.0-H-350M IBM	●○○○Low / Estimated	~38.32	0	0	0	0%
GPT-5 nano OpenAI	●○○○Low / Estimated	~46.36	0	0	28	0%
DeepSeek R1 Distill Qwen 32B DeepSeek	●○○○Low / Estimated	~42.58	0	0	0	0%
o1-pro OpenAI	●○○○Low / Estimated	~45.94	0	0	0	0%
Exaone 4.0 1.2B LG AI Research	●○○○Low / Estimated	~39.06	0	0	0	0%
K-Exaone LG AI Research	●○○○Low / Estimated	~48.45	0	0	0	0%
Solar Pro 2 Upstage	●○○○Low / Estimated	~41.19	0	0	0	0%

Model / evidenceConfidence / score

Qwen3.7 PlusAlibaba51 sourced · 51 rankable · 0 generated

●●●●High67.22provisional score

Claude Opus 4.5Anthropic43 sourced · 43 rankable · 0 generated

●●●●High64.22provisional score

Kimi K2.5Moonshot AI40 sourced · 40 rankable · 0 generated

●●●●High59.66provisional score

Qwen3.6 PlusAlibaba39 sourced · 39 rankable · 0 generated

●●●●High65.2provisional score

GLM-5Z.AI35 sourced · 35 rankable · 0 generated

●●●●High66.06provisional score

Qwen3.5 397BAlibaba35 sourced · 35 rankable · 0 generated

●●●●High57.01provisional score

Qwen3.7 MaxAlibaba33 sourced · 33 rankable · 0 generated

●●●●High72.84provisional score

Gemini 3.5 FlashGoogle23 sourced · 23 rankable · 0 generated

●●●●High64.75provisional score

Qwen3.6-35B-A3BAlibaba40 sourced · 40 rankable · 0 generated

●●●○Good51.47provisional score

Qwen3.6-27BAlibaba37 sourced · 37 rankable · 0 generated

●●●○Good53.82provisional score

Claude Opus 4.6Anthropic31 sourced · 31 rankable · 0 generated

●●●○Good68.59provisional score

GPT-5.4OpenAI31 sourced · 31 rankable · 0 generated

●●●○Good74.24provisional score

GPT-5.5OpenAI29 sourced · 29 rankable · 0 generated

●●●○Good73.51provisional score

Kimi K2.6Moonshot AI29 sourced · 29 rankable · 0 generated

●●●○Good56.79provisional score

Claude Opus 4.8Anthropic28 sourced · 28 rankable · 0 generated

●●●○Good78.34provisional score

Muse SparkMeta23 sourced · 23 rankable · 0 generated

●●●○Good71.04provisional score

GPT-5.6 SolOpenAI22 sourced · 22 rankable · 0 generated

●●●○Good81.96provisional score

GPT-5.6 TerraOpenAI20 sourced · 20 rankable · 0 generated

●●●○Good72.57provisional score

GPT-5.6 LunaOpenAI20 sourced · 20 rankable · 0 generated

●●●○Good67.17provisional score

Claude Opus 4.7 (Adaptive)Anthropic20 sourced · 20 rankable · 0 generated

●●●○Good66.27provisional score

Gemini 3.1 ProGoogle19 sourced · 19 rankable · 0 generated

●●●○Good55.3provisional score

Claude Sonnet 4.6Anthropic19 sourced · 19 rankable · 0 generated

●●●○Good65.07provisional score

InklingThinking Machines Lab16 sourced · 16 rankable · 0 generated

●●●○Good67.54provisional score

GPT-5.2OpenAI14 sourced · 14 rankable · 0 generated

●●●○Good58.43provisional score

Qwen3.5-122B-A10BAlibaba13 sourced · 13 rankable · 0 generated

●●●○Good60.56provisional score

Qwen3.5-27BAlibaba13 sourced · 13 rankable · 0 generated

●●●○Good60.7provisional score

Qwen3.5-35B-A3BAlibaba13 sourced · 13 rankable · 0 generated

●●●○Good56.97provisional score

GPT-5.4 miniOpenAI12 sourced · 12 rankable · 0 generated

●●●○Good56.77provisional score

DeepSeek V4 Pro (High)DeepSeek22 sourced · 22 rankable · 0 generated

●●○○Moderate55.47provisional score

DeepSeek V4 Flash (High)DeepSeek22 sourced · 22 rankable · 0 generated

●●○○Moderate53.95provisional score

DeepSeek V4 ProDeepSeek20 sourced · 20 rankable · 0 generated

●●○○Moderate60.66provisional score

DeepSeek V4 FlashDeepSeek20 sourced · 20 rankable · 0 generated

●●○○Moderate58.88provisional score

Muse Spark 1.1Meta19 sourced · 19 rankable · 0 generated

●●○○Moderate77.44provisional score

GLM-5.1Z.AI18 sourced · 18 rankable · 0 generated

●●○○Moderate67.74provisional score

MiniMax M3MiniMax16 sourced · 16 rankable · 0 generated

●●○○Moderate69.75provisional score

Grok 4.20xAI16 sourced · 16 rankable · 0 generated

●●○○Moderate54.68provisional score

Claude Mythos 5Anthropic15 sourced · 15 rankable · 0 generated

●●○○Moderate83.93provisional score

Claude Sonnet 5Anthropic15 sourced · 15 rankable · 0 generated

●●○○Moderate65.32provisional score

GLM-5.2Z.AI15 sourced · 15 rankable · 0 generated

●●○○Moderate63.96provisional score

Nemotron 3 Nano Omni 30B A3BNVIDIA14 sourced · 14 rankable · 0 generated

●●○○Moderate44.24provisional score

Claude Fable 5Anthropic11 sourced · 11 rankable · 0 generated

●●○○Moderate83.68provisional score

Gemini 3 ProGoogle11 sourced · 11 rankable · 0 generated

●●○○Moderate67.73provisional score

GPT-5.4 nanoOpenAI11 sourced · 11 rankable · 0 generated

●●○○Moderate66.79provisional score

GPT-5.4 ProOpenAI11 sourced · 11 rankable · 0 generated

●●○○Moderate60.89provisional score

Gemma 4 12BGoogle11 sourced · 11 rankable · 0 generated

●●○○Moderate47.29provisional score

Qwen 3.6 Max (preview)Alibaba10 sourced · 10 rankable · 0 generated

●●○○Moderate59.72provisional score

GLM-4.7Z.AI9 sourced · 9 rankable · 0 generated

●●○○Moderate61.16provisional score

Claude Sonnet 4.5Anthropic9 sourced · 9 rankable · 0 generated

●●○○Moderate53.61provisional score

Kimi K2.5 (Reasoning)Moonshot AI8 sourced · 8 rankable · 0 generated

●●○○Moderate59.35provisional score

MiMo-V2.5Xiaomi8 sourced · 8 rankable · 0 generated

●●○○Moderate58.62provisional score

Kimi K3Moonshot AI29 sourced · 7 rankable · 0 generated

●○○○Low / Estimated~80.96provisional score

MiniMax M2.7MiniMax17 sourced · 17 rankable · 0 generated

●○○○Low / Estimated~64.11provisional score

Step 3.7 FlashStepFun10 sourced · 10 rankable · 0 generated

●○○○Low / Estimated~50.87provisional score

GPT-5.3 CodexOpenAI7 sourced · 7 rankable · 0 generated

●○○○Low / Estimated~66.69provisional score

MiMo-V2.5-ProXiaomi7 sourced · 7 rankable · 0 generated

●○○○Low / Estimated~70.19provisional score

GPT-5.5 ProOpenAI6 sourced · 6 rankable · 0 generated

●○○○Low / Estimated~63.69provisional score

Grok 4.5xAI6 sourced · 6 rankable · 0 generated

●○○○Low / Estimated~76.72provisional score

Claude Opus 4.7Anthropic6 sourced · 6 rankable · 0 generated

●○○○Low / Estimated~71.94provisional score

DeepSeek V3.2DeepSeek6 sourced · 6 rankable · 0 generated

●○○○Low / Estimated~55.4provisional score

Grok 4.3xAI6 sourced · 6 rankable · 0 generated

●○○○Low / Estimated~65.1provisional score

LFM2.5-8B-A1BLiquidAI6 sourced · 6 rankable · 0 generated

●○○○Low / Estimated~41.42provisional score

Gemma 4 31BGoogle6 sourced · 6 rankable · 0 generated

●○○○Low / Estimated~61.08provisional score

GPT-4.1OpenAI6 sourced · 6 rankable · 0 generated

●○○○Low / Estimated~51.11provisional score

Gemini 3 FlashGoogle5 sourced · 5 rankable · 0 generated

●○○○Low / Estimated~60.49provisional score

o3-miniOpenAI5 sourced · 5 rankable · 0 generated

●○○○Low / Estimated~47.41provisional score

GPT-4.1 miniOpenAI5 sourced · 5 rankable · 0 generated

●○○○Low / Estimated~44.19provisional score

Gemini 2.5 ProGoogle5 sourced · 5 rankable · 0 generated

●○○○Low / Estimated~57.25provisional score

o1OpenAI4 sourced · 4 rankable · 0 generated

●○○○Low / Estimated~48.1provisional score

Claude Haiku 4.5Anthropic4 sourced · 4 rankable · 0 generated

●○○○Low / Estimated~56.58provisional score

Qwen3 235B 2507Alibaba4 sourced · 4 rankable · 0 generated

●○○○Low / Estimated~56.02provisional score

Qwen3.5 PlusAlibaba4 sourced · 4 rankable · 0 generated

●○○○Low / Estimated~47.2provisional score

Gemma 4 26B A4BGoogle4 sourced · 4 rankable · 0 generated

●○○○Low / Estimated~57.96provisional score

GPT-4.1 nanoOpenAI4 sourced · 4 rankable · 0 generated

●○○○Low / Estimated~42.06provisional score

Trinity-Large-PreviewArcee AI4 sourced · 4 rankable · 0 generated

●○○○Low / Estimated~55.94provisional score

Trinity-Large-ThinkingArcee AI4 sourced · 4 rankable · 0 generated

●○○○Low / Estimated~52.32provisional score

GPT-5.1OpenAI3 sourced · 3 rankable · 0 generated

●○○○Low / Estimated~53.65provisional score

Grok 4xAI3 sourced · 3 rankable · 0 generated

●○○○Low / Estimated~60.42provisional score

GLM-4.6Z.AI3 sourced · 3 rankable · 0 generated

●○○○Low / Estimated~55.12provisional score

Ling 2.6 FlashInclusionAI3 sourced · 3 rankable · 0 generated

●○○○Low / Estimated~43.87provisional score

Command A+Cohere3 sourced · 3 rankable · 0 generated

●○○○Low / Estimated~47.51provisional score

Exaone 4.0 32BLG AI Research2 sourced · 2 rankable · 0 generated

●○○○Low / Estimated~40.44provisional score

Claude 4.1 OpusAnthropic2 sourced · 2 rankable · 0 generated

●○○○Low / Estimated~45.88provisional score

Claude 4 SonnetAnthropic2 sourced · 2 rankable · 0 generated

●○○○Low / Estimated~42.79provisional score

Gemini 3.1 Flash-LiteGoogle2 sourced · 2 rankable · 0 generated

●○○○Low / Estimated~50.83provisional score

o4-mini (high)OpenAI2 sourced · 2 rankable · 0 generated

●○○○Low / Estimated~49.98provisional score

Kimi K2Moonshot AI2 sourced · 2 rankable · 0 generated

●○○○Low / Estimated~27.19provisional score

o3OpenAI2 sourced · 2 rankable · 0 generated

●○○○Low / Estimated~47.89provisional score

Qwen3 235B 2507 (Reasoning)Alibaba2 sourced · 2 rankable · 0 generated

●○○○Low / Estimated~58.01provisional score

Gemini 2.5 FlashGoogle2 sourced · 2 rankable · 0 generated

●○○○Low / Estimated~48.09provisional score

Qwen3.5 FlashAlibaba2 sourced · 2 rankable · 0 generated

●○○○Low / Estimated~47.65provisional score

Grok 3 [Beta]xAI2 sourced · 2 rankable · 0 generated

●○○○Low / Estimated~40.43provisional score

Claude 3.5 SonnetAnthropic2 sourced · 2 rankable · 0 generated

●○○○Low / Estimated~47.74provisional score

GPT-5 (high)OpenAI2 sourced · 2 rankable · 0 generated

●○○○Low / Estimated~58.61provisional score

GPT-5.2-CodexOpenAI2 sourced · 2 rankable · 0 generated

●○○○Low / Estimated~59.1provisional score

Kimi K2.7 CodeMoonshot AI2 sourced · 2 rankable · 0 generated

●○○○Low / Estimated~55provisional score

GPT-5.1-CodexOpenAI2 sourced · 2 rankable · 0 generated

●○○○Low / Estimated~52.72provisional score

MiMo-V2-FlashXiaomi1 sourced · 1 rankable · 0 generated

●○○○Low / Estimated~54.06provisional score

Gemini 3 Pro Deep ThinkGoogle1 sourced · 1 rankable · 0 generated

●○○○Low / Estimated~61.31provisional score

DeepSeek V3DeepSeek1 sourced · 1 rankable · 0 generated

●○○○Low / Estimated~44.97provisional score

GPT-4oOpenAI1 sourced · 1 rankable · 0 generated

●○○○Low / Estimated~41.49provisional score

Llama 4 ScoutMeta1 sourced · 1 rankable · 0 generated

●○○○Low / Estimated~39.87provisional score

Llama 4 MaverickMeta1 sourced · 1 rankable · 0 generated

●○○○Low / Estimated~23.49provisional score

Claude Opus 4.6 (Adaptive)Anthropic1 sourced · 1 rankable · 0 generated

●○○○Low / Estimated~64.18provisional score

GLM-5 (Reasoning)Z.AI1 sourced · 1 rankable · 0 generated

●○○○Low / Estimated~59.77provisional score

GLM-5-TurboZ.AI1 sourced · 1 rankable · 0 generated

●○○○Low / Estimated~66.89provisional score

MiMo-V2-ProXiaomi1 sourced · 1 rankable · 0 generated

●○○○Low / Estimated~67.78provisional score

GLM-5V-TurboZ.AI1 sourced · 1 rankable · 0 generated

●○○○Low / Estimated~63.5provisional score

Grok 4.1 Fast (Reasoning)xAI1 sourced · 1 rankable · 0 generated

●○○○Low / Estimated~60.51provisional score

MiMo-V2-OmniXiaomi1 sourced · 1 rankable · 0 generated

●○○○Low / Estimated~63.15provisional score

DeepSeek V3.2 (Thinking)DeepSeek1 sourced · 1 rankable · 0 generated

●○○○Low / Estimated~58.15provisional score

Grok 4 Fast (Reasoning)xAI1 sourced · 1 rankable · 0 generated

●○○○Low / Estimated~56.59provisional score

MiniMax M2.5MiniMax1 sourced · 1 rankable · 48 generated

●○○○Low / Estimated~59.52provisional score

GPT-OSS 120BOpenAI1 sourced · 1 rankable · 0 generated

●○○○Low / Estimated~50.08provisional score

GPT-5.1-Codex-MaxOpenAI1 sourced · 1 rankable · 0 generated

●○○○Low / Estimated~54.48provisional score

GPT-OSS 20BOpenAI1 sourced · 1 rankable · 0 generated

●○○○Low / Estimated~42.74provisional score

Nemotron 3 Super 100BNVIDIA1 sourced · 1 rankable · 0 generated

●○○○Low / Estimated~50.08provisional score

GPT-5 miniOpenAI1 sourced · 1 rankable · 60 generated

●○○○Low / Estimated~43.93provisional score

Qwen3 MaxAlibaba1 sourced · 1 rankable · 0 generated

●○○○Low / Estimated~48.16provisional score

Claude Opus 4.5 ThinkingAnthropic1 sourced · 1 rankable · 0 generated

●○○○Low / Estimated~57.44provisional score

Grok 4.1xAI0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~59.97provisional score

Qwen3.5 397B (Reasoning)Alibaba0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~59.5provisional score

GPT-5.2 InstantOpenAI0 sourced · 0 rankable · 32 generated

●○○○Low / Estimated~59provisional score

GPT-5.2 ProOpenAI0 sourced · 0 rankable · 32 generated

●○○○Low / Estimated~67.01provisional score

GPT-5.3 InstantOpenAI0 sourced · 0 rankable · 35 generated

●○○○Low / Estimated~58.9provisional score

Grok 4.1 FastxAI0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~51.25provisional score

DeepSeek V3.1 (Reasoning)DeepSeek0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~53.43provisional score

DeepSeek V3.1DeepSeek0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~53.64provisional score

Mistral Large 3Mistral0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~50.4provisional score

GLM-4.5Z.AI0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~57.56provisional score

DeepSeek-R1DeepSeek0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~51.67provisional score

GPT-5.3-Codex-SparkOpenAI0 sourced · 0 rankable · 44 generated

●○○○Low / Estimated~56.91provisional score

Step 3.5 FlashStepFun0 sourced · 0 rankable · 50 generated

●○○○Low / Estimated~55.1provisional score

o1-previewOpenAI0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~49.11provisional score

GLM-4.5-AirZ.AI0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~47.7provisional score

GLM-4.7-FlashZ.AI0 sourced · 0 rankable · 45 generated

●○○○Low / Estimated~51.25provisional score

Gemma 3 27BGoogle0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~41.57provisional score

Mistral Small 4Mistral0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~46.23provisional score

DeepSeek LLM 2.0DeepSeek0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~54.53provisional score

Mercury 2Inception0 sourced · 0 rankable · 48 generated

●○○○Low / Estimated~51.28provisional score

GPT-5 (medium)OpenAI0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~55.15provisional score

Claude 3 OpusAnthropic0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~41.13provisional score

Nemotron 3 Nano 30BNVIDIA0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~52.94provisional score

GPT-4o miniOpenAI0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~37.87provisional score

Mistral Large 2Mistral0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~41.77provisional score

Qwen2.5-72BAlibaba0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~52.15provisional score

Llama 3.1 405BMeta0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~51.71provisional score

Nemotron 3 Super 120B A12BNVIDIA0 sourced · 0 rankable · 49 generated

●○○○Low / Estimated~50.97provisional score

Llama 3 70BMeta0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~50.82provisional score

Qwen2.5 Coder 32B InstructAlibaba0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~34.7provisional score

DeepSeek Coder 2.0DeepSeek0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~50.28provisional score

Seed 1.6ByteDance0 sourced · 0 rankable · 32 generated

●○○○Low / Estimated~50.23provisional score

Gemini 1.5 ProGoogle0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~35.71provisional score

Phi-4Microsoft0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~22.69provisional score

Qwen2.5-1MAlibaba0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~49.89provisional score

DeepSeekMath V2DeepSeek0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~49.89provisional score

Seed-2.0-LiteByteDance0 sourced · 0 rankable · 32 generated

●○○○Low / Estimated~49.84provisional score

Ministral 3 14B (Reasoning)Mistral0 sourced · 0 rankable · 48 generated

●○○○Low / Estimated~49.34provisional score

o3-proOpenAI0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~48.33provisional score

Ministral 3 14BMistral0 sourced · 0 rankable · 48 generated

●○○○Low / Estimated~34.47provisional score

Aion-2.0Aion Labs0 sourced · 0 rankable · 32 generated

●○○○Low / Estimated~48.66provisional score

Mixtral 8x22B Instruct v0.1Mistral0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~48.54provisional score

Grok Code Fast 1xAI0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~38.64provisional score

GPT-4 TurboOpenAI0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~27.44provisional score

Z-1Z0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~45.13provisional score

Seed 1.6 FlashByteDance0 sourced · 0 rankable · 32 generated

●○○○Low / Estimated~45.08provisional score

Nemotron-4 15BNVIDIA0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~45.08provisional score

Mistral 8x7BMistral0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~44.99provisional score

Moonshot v1Moonshot AI0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~44.79provisional score

Seed-2.0-MiniByteDance0 sourced · 0 rankable · 32 generated

●○○○Low / Estimated~44.6provisional score

Nemotron Ultra 253BNVIDIA0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~44.4provisional score

Gemini 1.0 ProGoogle0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~21.78provisional score

Claude 3 HaikuAnthropic0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~21.39provisional score

LFM2-24B-A2BLiquidAI0 sourced · 0 rankable · 43 generated

●○○○Low / Estimated~18.87provisional score

Claude 4.1 Opus ThinkingAnthropic0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~36.55provisional score

Ministral 3 8B (Reasoning)Mistral0 sourced · 0 rankable · 48 generated

●○○○Low / Estimated~40.38provisional score

Nova ProAmazon0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~20.32provisional score

Ministral 3 8BMistral0 sourced · 0 rankable · 48 generated

●○○○Low / Estimated~20.96provisional score

Mistral 7B v0.3Mistral0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~39.9provisional score

Qwen2.5-VL-32BAlibaba0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~39.85provisional score

Llama 4 BehemothMeta0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~39.8provisional score

LFM2.5-1.2B-ThinkingLiquidAI0 sourced · 0 rankable · 43 generated

●○○○Low / Estimated~16.23provisional score

Ministral 3 3B (Reasoning)Mistral0 sourced · 0 rankable · 48 generated

●○○○Low / Estimated~39.52provisional score

MiniMax M1 80kMiniMax0 sourced · 0 rankable · 43 generated

●○○○Low / Estimated~25.12provisional score

Ministral 3 3BMistral0 sourced · 0 rankable · 48 generated

●○○○Low / Estimated~18.27provisional score

Mistral 8x7B v0.2Mistral0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~39.13provisional score

LFM2.5-1.2B-InstructLiquidAI0 sourced · 0 rankable · 43 generated

●○○○Low / Estimated~15.52provisional score

Hy3Tencent0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~55.64provisional score

Gemma 4 E2BGoogle0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~41.82provisional score

Gemma 4 E4BGoogle0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~43.2provisional score

Sarvam 105BSarvam0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~42.97provisional score

Sarvam 30BSarvam0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~40.7provisional score

Mistral Medium 3Mistral0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~43.2provisional score

Granite-4.0-350MIBM0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~38.32provisional score

Granite-4.0-H-350MIBM0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~38.32provisional score

GPT-5 nanoOpenAI0 sourced · 0 rankable · 28 generated

●○○○Low / Estimated~46.36provisional score

DeepSeek R1 Distill Qwen 32BDeepSeek0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~42.58provisional score

o1-proOpenAI0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~45.94provisional score

Exaone 4.0 1.2BLG AI Research0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~39.06provisional score

K-ExaoneLG AI Research0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~48.45provisional score

Solar Pro 2Upstage0 sourced · 0 rankable · 0 generated

●○○○Low / Estimated~41.19provisional score

Sourced = exact-source benchmark coverage. Rankable = non-generated benchmark coverage used by the provisional leaderboard. Generated = inferred from related models and excluded from ranking. Coverage = sourced share of the visible benchmark footprint.

Frequently Asked Questions

What is benchmark confidence on BenchLM?

Score confidence (1-4 dots) indicates how much sourced benchmark data supports a model's score. A 4-dot score has at least 20 sourced rows across seven categories. A 1-dot score has limited coverage. Some benchmark-sparse models can appear provisionally when several independent public evaluation families agree.

What does "estimated" mean on BenchLM scores?

Scores marked with "Est." or "~" have limited source-backed benchmark coverage or qualify through corroborating external evaluation families. Unresolved manual values and generated rows cannot score. The verified leaderboard requires sourced benchmark support; external consensus alone never upgrades an estimated row to verified status.

How does BenchLM detect contamination risk?

BenchLM tracks two key signals: (1) benchmark provenance — whether each score is a hand-entered public row ("manual") or was generated/inferred from related data, and (2) benchmark freshness — older benchmarks that haven't been updated are more likely to have been contaminated through training data inclusion. Models with mostly generated data or stale benchmarks receive lower confidence ratings. Exact-source verification is tracked separately from this manual-vs-generated split.

What is benchmark provenance?

Provenance records where each benchmark value came from and whether the exact model, benchmark variant, harness, and score can be supported. Source-backed rows can score. Unresolved manual values and generated estimates remain in the audit trail but do not affect either public ranking view.

Which LLM benchmarks are most reliable?

Fresh, held-out benchmarks like SWE-Rebench (rolling window), Terminal-Bench 2.0, and HLE are the hardest to game. Older, saturated benchmarks like MMLU (where top models all score 97-99%) provide little signal. BenchLM weights newer, harder benchmarks more heavily and flags saturated ones as display-only.

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.