Knowledge benchmark report

Best LLMs for Knowledge — July 2026 Leaderboard

Name: Knowledge Benchmarks — LLM Leaderboard
Creator: BenchLM.ai

As of July 2026, the top knowledge model on the BenchLM leaderboard is Muse Spark 1.1 with a weighted knowledge score of 93.5.

Data refreshed: July 23, 2026

General knowledge and factual understanding

Decision lens: use provisional-ranked mode for broader public evidence and verified-ranked mode for source-only comparisons. A model can move between views as evidence coverage changes.

Data refreshed: July 23, 2026
Provisional-ranked: 52 of 290 models
Verified-ranked: 27 of 290 models
Weighted evidence: 5 of 15 benchmarks

15 tracked benchmarks

MMLU, GPQA, GPQA-D, SuperGPQA, MMLU-Pro, HLE, FrontierScience, HLE w/o tools, SimpleQA, HealthBench Hard, HealthBench Professional, MedXpertQA (Text), FrontierScience Research, MMLU-Pro (Arcee), MMMLU

Scope: Frontier science, Broad academic knowledge, Factuality

Evidence set: MMLU, GPQA, GPQA-D, SuperGPQA, MMLU-Pro, HLE, FrontierScience, HLE w/o tools, SimpleQA, HealthBench Hard, HealthBench Professional, MedXpertQA (Text), FrontierScience Research, MMLU-Pro (Arcee), MMMLU

Scope: Frontier science, Broad academic knowledge, Factuality

Best Knowledge picks

BenchLM summaries for knowledge plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.

How BenchLM scores these

Best KnowledgeMuse Spark 1.1Meta

93.5category score

Best open weightGLM-5.2Z.AI

73overall score

Fastest measuredGPT-5.4 miniOpenAI

201tokens / sec

Largest useful contextClaude Mythos 5Anthropic

1M+context window

Knowledge Leaderboard

Primary score: weighted knowledge score. Higher values rank first. Use the Show metric control to change the value shown in each row.

Updated July 23, 2026Embed leaderboard

Switch between provisional-ranked and verified-ranked modes to compare the broader public dataset with sourced-only rankings.

Show metric

Filters

Provisional-ranked mode includes source-unverified non-generated benchmark evidence.P = provisional benchmark row


1 Muse Spark 1.1 Meta	Closed	Reasoning	1M	Not listed	Not listed	Not listed	93.5%	74	—	—	—	—	—	62.1%	—	52.2%	—	—	59.3%	—	—	—	—
2 Claude Opus 4.8 Anthropic	Closed	Reasoning	1M	$5.00 / $25.00	Not listed	Not listed	92.5%	82	—	93.6%	93.6%	—	—	57.9%	—	49.8%	—	—	—	—	—	—	—
3 Claude Sonnet 5 Anthropic	Closed	Reasoning	1M	$2.00 / $10.00	Not listed	Not listed	91.2%	79	—	—	—	—	—	57.4%	—	43.2%	—	—	—	—	—	—	—
4 Kimi K3 Moonshot AI	Closed	Reasoning	1.05M	$3.00 / $15.00	Not listed	Not listed	89.5%	Est.77	—	93.5%	93.5%	—	—	56%	—	43.5%	—	—	—	—	—	—	—
5 Claude Opus 4.7 (Adaptive) Anthropic	Closed	Reasoning	1M	$5.00 / $25.00	Not listed	Not listed	87.5%	73	—	94.2%	94.2%	—	—	54.7%	—	46.9%	—	—	—	—	—	—	—
6 GLM-5.2 Z.AI Self-host	Open	Reasoning	1M	$1.40 / $4.40	Not listed	Not listed	87%	73	—	91.2%	91.2%	—	—	54.7%	—	40.5%	—	—	—	—	—	—	—
7 Claude Opus 4.6 Anthropic	Closed	Standard	1M	$5.00 / $25.00	40	1.78s	85.8%	69	99%P	91.3%	89.2%	95%	82%	53%	88%P	40%	72%P	14.8%	—	52.1%	—	89.1%	—
8 GPT-5.5 OpenAI	Closed	Reasoning	1M	$5.00 / $30.00	Not listed	Not listed	83.5%	74	—	93.6%	93.6%	—	—	52.2%	—	41.4%	—	—	—	—	—	—	—
9 GPT-5.6 Sol OpenAI	Closed	Reasoning	1M	$5.00 / $30.00	Not listed	Not listed	83.2%	80	—	94.6%	94.6%	—	—	—	—	—	—	33.1%	60.5%	—	—	—	—
10 GPT-5.4 OpenAI	Closed	Reasoning	1.05M	$2.50 / $15.00	74	151.79s	82%	67	99%P	92.8%	92.8%	96%P	93%P	52.1%	91%P	39.8%	97%P	40.1%	48.1%	59.6%	—	—	—
11 GPT-5.6 Terra OpenAI	Closed	Reasoning	1M	$2.50 / $15.00	Not listed	Not listed	81.5%	77	—	92.9%	92.9%	—	—	—	—	—	—	32.7%	57.7%	—	—	—	—
12 GPT-5.2 OpenAI	Closed	Reasoning	400K	$1.75 / $14.00	73	130.34s	81%	55	99%P	92.4%	—	95%P	88%P	42%P	91%P	—	95%P	—	—	—	—	—	—
13 GPT-5.6 Luna OpenAI	Closed	Reasoning	1M	$1.00 / $6.00	Not listed	Not listed	80.9%	73	—	92.3%	92.3%	—	—	—	—	—	—	32.0%	55.7%	—	—	—	—
14 GLM-5 Z.AI Self-host	Open	Standard	200K	$1.00 / $3.20	74	1.64s	80.8%	64	91.7%P	86%	86.0%	66.8%	85.7%	50.4%	74%P	—	84%P	—	—	—	—	85.8%	—
15 Claude Sonnet 4.6 Anthropic	Closed	Standard	200K	$3.00 / $15.00	44	1.48s	79.7%	64	99%P	89.9%	—	95%	79.2%	49%	85%P	—	48.5%P	—	—	—	—	—	—
16 Qwen3.5-122B-A10B Alibaba Self-host	Open	Reasoning	262K	$0.00 / $0.00	Not listed	Not listed	77.4%	55	—	86.6%	—	67.1%	86.7%	—	—	—	—	—	—	—	—	—	—
17 Qwen3.5-27B Alibaba Self-host	Open	Reasoning	262K	$0.00 / $0.00	Not listed	Not listed	76.5%	54	—	85.5%	—	65.6%	86.1%	—	—	—	—	—	—	—	—	—	—
18 Qwen3.5-35B-A3B Alibaba Self-host	Open	Reasoning	262K	$0.00 / $0.00	Not listed	Not listed	75.3%	49	—	84.2%	—	63.4%	85.3%	—	—	—	—	—	—	—	—	—	—
19 MiMo-V2.5-Pro Xiaomi	Closed	Reasoning	1M	Not listed	Not listed	Not listed	74.8%	Est.60	—	—	—	—	—	48%	—	34%	—	—	—	—	—	—	—
20 Qwen3.7 Max Alibaba	Closed	Reasoning	1M	Not listed	Not listed	Not listed	73.8%	78	—	92.4%	92.4%	73.6%	89.6%	41.4%	—	—	—	—	—	—	—	—	90.3%
21 DeepSeek V4 Pro (Max) DeepSeek Self-host	Open	Reasoning	1M	$0.43 / $0.87	Not listed	Not listed	73.5%	67	—	90.1%	90.1%	—	87.5%	37.7%	—	—	57.9%	—	—	—	—	—	—
22 Gemini 3.1 Pro Google	Closed	Standard	1M	$2.00 / $12.00	109	29.71s	73%	73	—	97%P	94.3%	95%P	92%P	40%P	88%P	45.4%	95%P	20.6%	—	71.5%	—	—	—
23 Claude Fable 5 Anthropic	Closed	Reasoning	1M+	$10.00 / $50.00	Not listed	Not listed	72.9%	80	—	94.5%P	—	—	—	64.5%P	—	59%P	—	—	—	—	—	—	—
24 Inkling Thinking Machines Lab Self-host	Open	Standard	1M	$1.87 / $4.68	Not listed	Not listed	72.5%	62	—	87.9%	87.9%	—	—	46%	—	30%	—	—	—	—	—	—	—
25 Qwen3 235B 2507 Alibaba Self-host	Open	Standard	128K	$0.00 / $0.00	Not listed	Not listed	72.1%	Est.53	39%P	77.5%	—	62.6%	83%	1%P	39%P	—	54.3%P	—	—	—	—	—	—

Showing 25 of 52

Rank / modelWeighted Knowledge

1

Muse Spark 1.1Meta · Closed

93.5%

2

Claude Opus 4.8Anthropic · Closed

92.5%

3

Claude Sonnet 5Anthropic · Closed

91.2%

4

Kimi K3Moonshot AI · Closed

89.5%

5

Claude Opus 4.7 (Adaptive)Anthropic · Closed

87.5%

6

GLM-5.2Z.AI · Open weight

87%

7

Claude Opus 4.6Anthropic · Closed

85.8%

8

GPT-5.5OpenAI · Closed

83.5%

9

GPT-5.6 SolOpenAI · Closed

83.2%

10

GPT-5.4OpenAI · Closed

82%

11

GPT-5.6 TerraOpenAI · Closed

81.5%

12

GPT-5.2OpenAI · Closed

81%

13

GPT-5.6 LunaOpenAI · Closed

80.9%

14

GLM-5Z.AI · Open weight

80.8%

15

Claude Sonnet 4.6Anthropic · Closed

79.7%

16

Qwen3.5-122B-A10BAlibaba · Open weight

77.4%

17

Qwen3.5-27BAlibaba · Open weight

76.5%

18

Qwen3.5-35B-A3BAlibaba · Open weight

75.3%

19

MiMo-V2.5-ProXiaomi · Closed

74.8%

20

Qwen3.7 MaxAlibaba · Closed

73.8%

21

DeepSeek V4 Pro (Max)DeepSeek · Open weight

73.5%

22

Gemini 3.1 ProGoogle · Closed

73%

23

Claude Fable 5Anthropic · Closed

72.9%

24

InklingThinking Machines Lab · Open weight

72.5%

25

Qwen3 235B 2507Alibaba · Open weight

72.1%

Top AI Models for Knowledge — July 2026

As of July 2026, Muse Spark 1.1 leads the provisional knowledge leaderboard with a score of 93.5%, followed by Claude Opus 4.8 (92.5%) and Claude Sonnet 5 (91.2%). BenchLM is currently showing 52 provisional-ranked models and 27 verified-ranked models in this category.

RankModelWeighted scoreEvidence and fit

1

Muse Spark 1.1Meta · Proprietary

93.5%Weighted score

HLE 62.1

2

Claude Opus 4.8Anthropic · Proprietary

92.5%Weighted score

GPQA 93.6HLE 57.9

3

Claude Sonnet 5Anthropic · Proprietary

91.2%Weighted score

HLE 57.4

What changed

Claude Mythos Preview leads knowledge with the strongest HLE and FrontierScience scores.

GPT-5.4 close second, with excellent GPQA Diamond scores.

Claude Opus 4.6 holds #3, strong on SuperGPQA and SimpleQA factual accuracy.

How to choose

Research-level scientific Q&A?Claude Mythos Preview — best HLE scoresGraduate-level STEM questions?GPT-5.4 — leads on GPQA DiamondFactual accuracy matters most?Claude Opus 4.6 — best SimpleQA scoreBroad knowledge on a budget?Gemini 3.1 Pro — strong MMLU-Pro at low cost

Top models by benchmark

Expert-level questions in biology, physics, and chemistry(7% of category score)

RankModelReported score

1Sakana Fugu-Ultra95.5

2Sakana Fugu95.5

3GPT-5.6 Sol94.6

4Claude Opus 4.7 (Adaptive)94.2

5Claude Mythos 594.1

Score in Context

What these scores mean

Knowledge carries a 12% weight in overall scoring. The weighted score blends expert-level tests (HLE, FrontierScience, GPQA) with broad knowledge (MMLU-Pro). A model scoring 90+ on MMLU might still struggle with research-level scientific reasoning — broad knowledge doesn't guarantee deep expertise.

Known limitations

MMLU is saturated — top models score 90%+, making it poor at differentiating. HLE ("Humanity's Last Exam") is deliberately very hard, so even top models score below 30%. Small score differences on HLE are noisy. SimpleQA measures factual accuracy but can penalize models that hedge appropriately.

How we weight

Knowledge carries a 12% weight in BenchLM.ai's overall scoring. A model scoring 90+ on MMLU might still struggle with research-level scientific reasoning — broad knowledge doesn't guarantee deep expertise.

For tasks like research assistance, factual Q&A, content creation, and educational applications, knowledge benchmark scores remain one of the strongest predictive signals. See the knowledge leaderboard for the top models in this category.

Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.

The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.

Scroll horizontally to read the full evidence ledger.

Knowledge benchmark weights, ranking status, and descriptions
Benchmark	Weight	Status	Description
MMLU	—	Display only	Tests knowledge across 57 academic subjects
GPQA	7%	Weighted	Expert-level questions in biology, physics, and chemistry
GPQA-D	—	Display only	Provider-table reference for GPQA Diamond scores reported in first-party comparison charts.
SuperGPQA	7%	Weighted	Enhanced version covering 285 disciplines
MMLU-Pro	30%	Weighted	Harder version of MMLU with 10 answer choices and more reasoning-focused questions
HLE	45%	Weighted	Extremely difficult questions contributed by domain experts worldwide to test frontier AI
FrontierScience	—	Display only	Research-level science and scientific reasoning benchmark
HLE w/o tools	—	Display only	Tool-free variant of Humanity's Last Exam used to isolate raw frontier reasoning without external aids
SimpleQA	11%	Weighted	Factual question answering benchmark
HealthBench Hard	—	Display only	A harder health reasoning benchmark subset used in first-party frontier model comparisons.
HealthBench Professional	—	Display only	An open benchmark for clinician-facing model responses across care consult, writing and documentation, and medical research tasks.
MedXpertQA (Text)	—	Display only	Medical multiple-choice benchmark covering many specialties with text-only questions.
FrontierScience Research	—	Display only	A research-oriented FrontierScience variant focused on scientific investigation and solution quality.
MMLU-Pro (Arcee)	—	Display only	Display-only MMLU-Pro reference from Arcee AI's Trinity-Large-Thinking launch chart.
MMMLU	—	Display only	A multilingual MMLU-style benchmark reported in provider evaluation tables.

About Knowledge Benchmarks

Tests knowledge across 57 academic subjects

Common questions

What is the best LLM for knowledge tasks?

The top LLMs for knowledge tasks are ranked by benchmarks like MMLU and GPQA, which test factual accuracy and expert-level understanding across dozens of subjects.

What is MMLU and how does it measure LLM knowledge?

MMLU (Massive Multitask Language Understanding) tests LLMs across 57 subjects from STEM to humanities, measuring broad factual knowledge and reasoning at varying difficulty levels.

What benchmarks test knowledge in AI models?

Key knowledge benchmarks include MMLU, MMLU-Pro, GPQA, SuperGPQA, HLE, and FrontierScience, each evaluating different depths of factual and scientific understanding.

How do knowledge benchmarks differ from reasoning benchmarks?

Knowledge benchmarks focus on factual recall and domain expertise, while reasoning benchmarks test logical deduction and multi-step problem solving independent of specific facts.

Knowledge benchmark updates

Know which model knows the most — updated every week.

One email each week. Unsubscribe anytime.