Skip to main content

Artificial Analysis Humanity's Last Exam (AA-HLE)

A display-only Artificial Analysis Humanity's Last Exam score.

Benchmark score on AA-HLE — June 2, 2026

BenchLM mirrors the published score view for AA-HLE. Claude Opus 4.8 leads the public snapshot at 45.7% , followed by Gemini 3.1 Pro (44.7%) and GPT-5.5 (44.3%). BenchLM does not use these results to rank models overall.

124 modelsKnowledgeCurrentDisplay onlyUpdated June 2, 2026

The published AA-HLE snapshot is tightly clustered at the top: Claude Opus 4.8 sits at 45.7%, while the third row is only 1.4 points behind. The broader top-10 spread is 8.5 points, so many of the published scores sit in a relatively narrow band.

124 models have been evaluated on AA-HLE. The benchmark falls in the Knowledge category. This category carries a 12% weight in BenchLM.ai's overall scoring system. AA-HLE is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About AA-HLE

Year

2026

Tasks

Expert-level questions

Format

Accuracy

Difficulty

Frontier expert reasoning

BenchLM stores the Artificial Analysis HLE result separately from the weighted HLE lane so AA refreshes remain display-only.

BenchLM freshness & provenance

Version

AA-HLE 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Benchmark score table (124 models)

1
45.7%
2
44.7%
3
44.3%
4
41.6%
5
41.0%
6
39.9%
7
39.9%
8
39.6%
9
38.1%
10
37.2%
11
36.7%
12
35.9%
13
35.9%
14
35.4%
15
35.0%
16
33.8%
17
33.5%
18
33.5%
19
32.1%
20
31.2%
21
29.4%
22
29.4%
23
28.9%
24
28.4%
25
28.3%
26
28.1%
27
28.0%
28
27.8%
29
27.3%
30
27.2%
31
26.6%
32
26.5%
33
26.5%
34
26.5%
35
25.7%
36
25.5%
37
25.4%
38
25.1%
39
23.9%
40
23.5%
41
23.4%
42
23.4%
43
23.4%
44
22.7%
45
22.2%
46
21.6%
47
21.1%
48
20.2%
49
20.0%
50
19.9%
51
19.9%
52
19.7%
53
18.8%
54
18.6%
55
18.5%
56
18.3%
58
17.0%
59
16.2%
60
15.8%
61
14.9%
62
14.7%
63
14.7%
64
14.1%
65
13.2%
66
13.1%
67
13.0%
68
12.9%
69
12.8%
70
11.9%
71
11.4%
72
11.1%
73
10.5%
74
10.1%
75
9.8%
77
9.5%
78
8.7%
79
8.1%
80
8.0%
81
7.7%
82
7.5%
83
7.0%
84
7.0%
85
6.8%
86
6.4%
87
6.3%
88
6.2%
89
5.8%
90
5.7%
93
5.2%
94
5.1%
95
5.1%
96
5.0%
97
5.0%
98
4.9%
99
4.9%
100
4.8%
101
4.8%
102
4.7%
103
4.6%
104
4.6%
105
4.6%
106
4.6%
107
4.3%
108
4.3%
109
4.2%
110
4.1%
111
4.1%
112
4.0%
113
4.0%
114
4.0%
115
3.9%
116
3.9%
117
3.8%
118
3.8%
119
3.7%
120
3.6%
121
3.4%
122
3.3%
123
3.3%
124
3.1%

FAQ

What does AA-HLE measure?

A display-only Artificial Analysis Humanity's Last Exam score.

Which model scores highest on AA-HLE?

Claude Opus 4.8 by Anthropic currently leads with a score of 45.7% on AA-HLE.

How many models are evaluated on AA-HLE?

124 AI models have been evaluated on AA-HLE on BenchLM.

Last updated: June 2, 2026 · BenchLM version AA-HLE 2026

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.