Skip to main content
Skip to main content
Knowledge

Knowledge Benchmarks — GPQA, HLE & FrontierScience Leaderboard

General knowledge and factual understanding

Bottom line: Broad knowledge (MMLU) is saturated. The real differentiators are HLE and FrontierScience — frontier-difficulty tests where top models still score below 30%.

MMLU · GPQA · GPQA-D · SuperGPQA · MMLU-Pro · HLE · FrontierScience · HLE w/o tools · SimpleQA · HealthBench Hard · MedXpertQA (Text) · FrontierScience Research · MMLU-Pro (Arcee) · MMMLU

Frontier scienceBroad academic knowledgeFactuality

Best Knowledge picks

BenchLM summaries for knowledge plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.

Top AI Models for KnowledgeJune 2026

As of June 2026, Claude Opus 4.7 (Adaptive) leads the provisional knowledge leaderboard with a weighted score of 99.1%, followed by GPT-5.4 (99.0%) and Claude Opus 4.8 (98.8%). BenchLM is currently showing 104 provisional-ranked models and 24 verified-ranked models in this category.

What changed

Claude Mythos Preview leads knowledge with the strongest HLE and FrontierScience scores.

GPT-5.4 close second, with excellent GPQA Diamond scores.

Claude Opus 4.6 holds #3, strong on SuperGPQA and SimpleQA factual accuracy.

How to choose

Top models by benchmark

Expert-level questions in biology, physics, and chemistry(12% of category score)

Knowledge Leaderboard

Updated June 2, 2026

Sorted by knowledge weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.

104 ranked models
CSVJSON
Provisional-ranked mode includes source-unverified non-generated benchmark evidence.P = provisional benchmark row
99.1%
85
94.2%94.2%54.7%46.9%
2
GPT-5.4
OpenAI
99%
89
92.8%92.8%52.1%39.8%40.1%59.6%
3
98.8%
95
93.6%93.6%57.9%49.8%
94.5%
92
94.3%45.4%20.6%71.5%
94.3%
Est.90
92.8%
Est.86
7
GPT-5.2
OpenAI
91.5%
79
92.4%
8
90.7%
87
91.3%89.2%95%82%53%40%14.8%52.1%89.1%
87.1%
Est.90
10
86.5%
91
92.4%92.4%73.6%89.6%41.4%90.3%
84.3%
82
86.2%52.3%
12
83.2%
76
87%70.6%89.5%30.8%
13
GLM-5
Z.AI
Self-host
82.7%
67
86%86.0%66.8%85.7%50.4%85.8%
14
82.6%
81
82.4%
83
89.9%95%79.2%49%
16
GLM-5 (Reasoning)
Z.AI
Self-host
81.8%
Est.80
17
GPT-5.1
OpenAI
81.8%
Est.78
18
80.1%
Est.83
80%
64
86.6%67.1%86.7%
20
79.2%
Est.77
78.5%
Est.75
22
Qwen3.5 397B (Reasoning)
Alibaba
Self-host
78.3%
Est.78
77.8%
Est.76
77.8%
62
85.5%65.6%86.1%
77.1%
87
90.1%90.1%87.5%37.7%57.9%
Showing 25 of 104

These rankings update weekly

Get notified when models move. One email a week with what changed and why.

Free. No spam. Unsubscribe anytime.

Score in Context

What these scores mean

Knowledge carries a 12% weight in overall scoring. The weighted score blends expert-level tests (HLE, FrontierScience, GPQA) with broad knowledge (MMLU-Pro). A model scoring 90+ on MMLU might still struggle with research-level scientific reasoning — broad knowledge doesn't guarantee deep expertise.

Known limitations

MMLU is saturated — top models score 90%+, making it poor at differentiating. HLE ("Humanity's Last Exam") is deliberately very hard, so even top models score below 30%. Small score differences on HLE are noisy. SimpleQA measures factual accuracy but can penalize models that hedge appropriately.

How we weight

Knowledge carries a 12% weight in BenchLM.ai's overall scoring. A model scoring 90+ on MMLU might still struggle with research-level scientific reasoning — broad knowledge doesn't guarantee deep expertise.

For tasks like research assistance, factual Q&A, content creation, and educational applications, knowledge benchmark scores remain one of the strongest predictive signals. See the knowledge leaderboard for the top models in this category.

Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.

The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.

BenchmarkWeightStatusDescription
MMLUDisplay onlyTests knowledge across 57 academic subjects
GPQA12%WeightedExpert-level questions in biology, physics, and chemistry
GPQA-DDisplay onlyProvider-table reference for GPQA Diamond scores reported in first-party comparison charts.
SuperGPQA12%WeightedEnhanced version covering 285 disciplines
MMLU-Pro22%WeightedHarder version of MMLU with 10 answer choices and more reasoning-focused questions
HLE23%WeightedExtremely difficult questions contributed by domain experts worldwide to test frontier AI
FrontierScience18%WeightedResearch-level science and scientific reasoning benchmark
HLE w/o toolsDisplay onlyTool-free variant of Humanity's Last Exam used to isolate raw frontier reasoning without external aids
SimpleQA13%WeightedFactual question answering benchmark
HealthBench HardDisplay onlyA harder health reasoning benchmark subset used in first-party frontier model comparisons.
MedXpertQA (Text)Display onlyMedical multiple-choice benchmark covering many specialties with text-only questions.
FrontierScience ResearchDisplay onlyA research-oriented FrontierScience variant focused on scientific investigation and solution quality.
MMLU-Pro (Arcee)Display onlyDisplay-only MMLU-Pro reference from Arcee AI's Trinity-Large-Thinking launch chart.
MMMLUDisplay onlyA multilingual MMLU-style benchmark reported in provider evaluation tables.

Knowledge benchmark updates

Know which model knows the most — updated every week.

Free. No spam. Unsubscribe anytime.

About Knowledge Benchmarks

Tests knowledge across 57 academic subjects

Related