Benchmark profile

Artificial Analysis Humanity's Last Exam (AA-HLE)

A display-only Artificial Analysis Humanity's Last Exam score.

Data verified July 23, 2026

Benchmark score on AA-HLE — July 23, 2026

BenchLM mirrors the published score view for AA-HLE. Claude Fable 5 leads the public snapshot at 53.3% , followed by GPT-5.6 Sol (47.2%) and Claude Opus 4.8 (45.7%). BenchLM does not use these results to rank models overall.

1Closed

Claude Fable 5

Anthropic

claude-fable-5

53.3%

Overall 83.68Context 1M+

2Closed

GPT-5.6 Sol

OpenAI

gpt-5-6-sol

47.2%

Overall 81.96Context 1M

3Closed

Claude Opus 4.8

Anthropic

claude-opus-4-8

45.7%

Overall 78.34Context 1M

162 modelsKnowledgeCurrentDisplay onlyUpdated July 23, 2026

Benchmark score table (162 models)

Score

Claude Fable 5Anthropic · Closed

53.3%

GPT-5.6 SolOpenAI · Closed

47.2%

Claude Opus 4.8Anthropic · Closed

45.7%

Muse Spark 1.1Meta · Closed

45.1%

Gemini 3.1 ProGoogle · Closed

44.7%

Kimi K3Moonshot AI · Closed

44.3%

GPT-5.5OpenAI · Closed

44.3%

GPT-5.6 TerraOpenAI · Closed

41.8%

GPT-5.4OpenAI · Closed

41.6%

Gemini 3.5 FlashGoogle · Closed

41.0%

Grok 4.5xAI · Closed

40.3%

GLM-5.2Z.AI · Open weight

40.1%

GPT-5.3 CodexOpenAI · Closed

39.9%

Muse SparkMeta · Closed

39.9%

GPT-5.3-Codex-SparkOpenAI · Closed

39.9%

Claude Sonnet 5Anthropic · Closed

39.6%

Claude Opus 4.7 (Adaptive)Anthropic · Closed

39.6%

Gemini 3.6 FlashGoogle · Closed

38.3%

Qwen3.7 MaxAlibaba · Closed

38.1%

GPT-5.6 LunaOpenAI · Closed

37.2%

Gemini 3 ProGoogle · Closed

37.2%

MiniMax M3MiniMax · Open weight

37.1%

Claude Opus 4.6 (Adaptive)Anthropic · Closed

36.7%

DeepSeek V4 Pro (Max)DeepSeek · Open weight

35.9%

Kimi K2.6Moonshot AI · Open weight

35.9%

GPT-5.2OpenAI · Closed

35.4%

Grok 4.3xAI · Closed

35.0%

MiMo-V2.5-ProXiaomi · Closed

33.8%

DeepSeek V4 Pro (High)DeepSeek · Open weight

33.5%

GPT-5.2-CodexOpenAI · Closed

33.5%

Qwen3.7 PlusAlibaba · Closed

33.4%

Kimi K2.7 CodeMoonshot AI · Open weight

32.8%

DeepSeek V4 Flash (Max)DeepSeek · Open weight

32.1%

Hy3 PreviewTencent · Open weight

31.6%

Hy3Tencent · Open weight

31.6%

Claude Opus 4.7Anthropic · Closed

31.2%

InklingThinking Machines Lab · Open weight

29.7%

Kimi K2.5Moonshot AI · Open weight

29.4%

Kimi K2.5 (Reasoning)Moonshot AI · Closed

29.4%

Qwen 3.6 Max (preview)Alibaba · Closed

28.9%

Claude Opus 4.5 ThinkingAnthropic · Closed

28.4%

MiMo-V2-ProXiaomi · Closed

28.3%

MiniMax M2.7MiniMax · Open weight

28.1%

GLM-5.1Z.AI · Open weight

28.0%

DeepSeek V4 Flash (High)DeepSeek · Open weight

27.8%

Qwen3.5 397BAlibaba · Open weight

27.3%

Qwen3.5 397B (Reasoning)Alibaba · Open weight

27.3%

GLM-5Z.AI · Open weight

27.2%

GPT-5.4 miniOpenAI · Closed

26.6%

Nemotron 3 UltraNVIDIA · Open weight

26.6%

GPT-5.4 nanoOpenAI · Closed

26.5%

GPT-5.1OpenAI · Closed

26.5%

GPT-5 (high)OpenAI · Closed

26.5%

Qwen3.6 PlusAlibaba · Closed

25.7%

GLM-5-TurboZ.AI · Closed

25.4%

GLM-4.7Z.AI · Open weight

25.1%

Grok 4xAI · Closed

23.9%

GPT-5 (medium)OpenAI · Closed

23.5%

Qwen3.5-122B-A10BAlibaba · Open weight

23.4%

GPT-5.1-Codex-MaxOpenAI · Closed

23.4%

GPT-5.1-CodexOpenAI · Closed

23.4%

Gemma 4 31BGoogle · Open weight

22.7%

Step 3.5 FlashStepFun · Open weight

22.6%

Qwen3.5-27BAlibaba · Open weight

22.2%

Qwen3.6-27BAlibaba · Open weight

21.6%

Gemini 2.5 ProGoogle · Closed

21.1%

Qwen3.6-35B-A3BAlibaba · Open weight

20.2%

o3OpenAI · Closed

20.0%

Step 3.7 FlashStepFun · Open weight

19.9%

MiMo-V2-OmniXiaomi · Closed

19.9%

Qwen3.5-35B-A3BAlibaba · Open weight

19.7%

GPT-5 miniOpenAI · Closed

19.7%

Nemotron 3 Super 120B A12BNVIDIA · Open weight

19.2%

MiniMax M2.5MiniMax · Closed

19.1%

Claude Opus 4.6Anthropic · Closed

18.6%

GPT-OSS 120BOpenAI · Open weight

18.5%

Gemma 4 26B A4BGoogle · Open weight

18.3%

Grok 4.1 Fast (Reasoning)xAI · Closed

17.6%

Gemini 3.5 Flash-LiteGoogle · Closed

17.5%

Grok 4 Fast (Reasoning)xAI · Closed

17.0%

Gemini 3.1 Flash-LiteGoogle · Closed

16.2%

GLM-5V-TurboZ.AI · Closed

15.8%

Mercury 2Inception · Closed

15.5%

DeepSeek-R1DeepSeek · Open weight

14.9%

Gemma 4 12BGoogle · Open weight

14.8%

Trinity-Large-PreviewArcee AI · Open weight

14.7%

Trinity-Large-ThinkingArcee AI · Open weight

14.7%

Gemini 3 FlashGoogle · Closed

14.1%

Claude Sonnet 4.6Anthropic · Closed

13.2%

K-ExaoneLG AI Research · Closed

13.1%

DeepSeek V3.1 (Reasoning)DeepSeek · Open weight

13.0%

Claude Opus 4.5Anthropic · Closed

12.9%

Mistral Medium 3.5 128BMistral · Open weight

12.8%

Claude 4.1 Opus ThinkingAnthropic · Closed

11.9%

Command A+Cohere · Open weight

11.4%

Qwen3 MaxAlibaba · Closed

11.1%

DeepSeek V3.2DeepSeek · Open weight

10.5%

Nemotron 3 Nano 30BNVIDIA · Open weight

10.2%

Sarvam 105BSarvam · Open weight

10.1%

100

GPT-OSS 20BOpenAI · Open weight

9.8%

101

Mistral Small 4Mistral · Open weight

9.5%

102

Mistral Small 4 (Reasoning)Mistral · Open weight

9.5%

103

o3-miniOpenAI · Closed

8.7%

104

GPT-5 nanoOpenAI · Closed

8.2%

105

MiniMax M1 80kMiniMax · Closed

8.2%

106

Nemotron Ultra 253BNVIDIA · Open weight

8.1%

107

MiMo-V2-FlashXiaomi · Open weight

8.0%

108

o1OpenAI · Closed

7.7%

109

Grok Code Fast 1xAI · Closed

7.5%

110

GLM-4.7-FlashZ.AI · Open weight

7.1%

111

Kimi K2Moonshot AI · Closed

7.0%

112

Sarvam 30BSarvam · Open weight

7.0%

113

LFM2.5-8B-A1BLiquidAI · Open weight

6.9%

114

GLM-4.5-AirZ.AI · Closed

6.8%

115

LFM2.5-1.2B-InstructLiquidAI · Closed

6.8%

116

Granite-4.0-H-350MIBM · Open weight

6.4%

117

DeepSeek V3.1DeepSeek · Open weight

6.3%

118

Ling 2.6 FlashInclusionAI · Open weight

6.2%

119

LFM2.5-1.2B-ThinkingLiquidAI · Closed

6.1%

120

Exaone 4.0 1.2BLG AI Research · Open weight

5.8%

121

Granite-4.0-350MIBM · Open weight

5.7%

122

DeepSeek R1 Distill Qwen 32BDeepSeek · Open weight

5.5%

123

Nemotron 3 Nano Omni 30B A3BNVIDIA · Open weight

5.3%

124

Ministral 3 3B (Reasoning)Mistral · Open weight

5.3%

125

Ministral 3 3BMistral · Open weight

5.3%

126

GLM-4.6Z.AI · Open weight

5.2%

127

Gemini 2.5 FlashGoogle · Closed

5.1%

128

LFM2.5-VL-1.6B-ExtractLiquidAI · Open weight

5.1%

129

Granite-4.0-1BIBM · Open weight

5.1%

130

Grok 4.1 FastxAI · Closed

5.0%

131

Granite-4.0-H-1BIBM · Open weight

5.0%

132

Exaone 4.0 32BLG AI Research · Open weight

4.9%

133

Gemini 1.5 ProGoogle · Closed

4.9%

134

Llama 4 MaverickMeta · Open weight

4.8%

135

Gemma 4 E2BGoogle · Open weight

4.8%

136

Gemma 3 27BGoogle · Open weight

4.7%

137

GPT-4.1OpenAI · Closed

4.6%

138

GPT-4.1 miniOpenAI · Closed

4.6%

139

Gemini 1.0 ProGoogle · Closed

4.6%

140

Ministral 3 14B (Reasoning)Mistral · Open weight

4.6%

141

Ministral 3 14BMistral · Open weight

4.6%

142

LFM2-24B-A2BLiquidAI · Closed

4.4%

143

Llama 4 ScoutMeta · Open weight

4.3%

144

Mistral Medium 3Mistral · Closed

4.3%

145

Ministral 3 8B (Reasoning)Mistral · Open weight

4.3%

146

Ministral 3 8BMistral · Open weight

4.3%

147

Llama 3.1 405BMeta · Open weight

4.2%

148

Mistral Large 3Mistral · Closed

4.1%

149

Phi-4Microsoft · Open weight

4.1%

150

Claude 4 SonnetAnthropic · Closed

4.0%

151

GPT-4o miniOpenAI · Closed

4.0%

152

Mistral Large 2Mistral · Closed

4.0%

153

GPT-4.1 nanoOpenAI · Closed

3.9%

154

Claude 3 HaikuAnthropic · Closed

3.9%

155

Qwen2.5 Coder 32B InstructAlibaba · Open weight

3.8%

156

Solar Pro 2Upstage · Closed

3.8%

157

Gemma 4 E4BGoogle · Open weight

3.7%

158

DeepSeek V3DeepSeek · Open weight

3.6%

159

Nova ProAmazon · Closed

3.4%

160

GPT-4oOpenAI · Closed

3.3%

161

GPT-4 TurboOpenAI · Closed

3.3%

162

Claude 3 OpusAnthropic · Closed

3.1%

The published AA-HLE snapshot places Claude Fable 5 first at 53.3%. The third row is 7.6 points behind. The broader top-10 range is 12.3 points, so the table still separates the published systems.

162 models have been evaluated on AA-HLE. The benchmark falls in the Knowledge category. This category carries a 12% weight in BenchLM.ai's overall scoring system. AA-HLE is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About AA-HLE

Year

2026

Tasks

Expert-level questions

Format

Accuracy

Difficulty

Frontier expert reasoning

BenchLM stores the Artificial Analysis HLE result separately from the weighted HLE lane so AA refreshes remain display-only.

Artificial Analysis Humanity's Last Exam Benchmark Leaderboard

BenchLM freshness & provenance

Version

AA-HLE 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does AA-HLE measure?

A display-only Artificial Analysis Humanity's Last Exam score.

Which model scores highest on AA-HLE?

Claude Fable 5 by Anthropic currently leads with a score of 53.3% on AA-HLE.

How many models are evaluated on AA-HLE?

162 AI models have been evaluated on AA-HLE on BenchLM.

Compare Top Models on AA-HLE

Claude Fable 5 vs GPT-5.6 Sol GPT-5.6 Sol vs Claude Opus 4.8 Claude Opus 4.8 vs Muse Spark 1.1 Muse Spark 1.1 vs Gemini 3.1 Pro

Last updated: July 23, 2026 · BenchLM version AA-HLE 2026

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.