Skip to main content

GDPval-AA

An agentic real-world work-task evaluation reported as an Elo score in DeepSeek-V4 thinking-mode evaluations.

Benchmark score on GDPval-AA — June 2, 2026

BenchLM mirrors the published score view for GDPval-AA. Claude Opus 4.8 leads the public snapshot at 1890 , followed by GPT-5.5 (1769) and Claude Opus 4.7 (Adaptive) (1753). BenchLM does not use these results to rank models overall.

114 modelsAgenticCurrentDisplay onlyUpdated June 2, 2026

The published GDPval-AA snapshot is tightly clustered at the top: Claude Opus 4.8 sits at 1890, while the third row is only 137 points behind. The broader top-10 spread is 319 points, so the benchmark still separates strong models even when the leaders cluster.

114 models have been evaluated on GDPval-AA. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. GDPval-AA is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About GDPval-AA

Year

2026

Tasks

Agentic real-world work tasks

Format

Elo

Difficulty

Professional agentic workflows

BenchLM stores GDPval-AA as a display-only provider-table row for DeepSeek-V4 because the source reports an Elo score rather than a 0-100 percentage.

BenchLM freshness & provenance

Version

GDPval-AA 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Benchmark score table (114 models)

1
1890
2
1769
3
1753
4
1677
5
1674
6
1656
7
1619
8
1599
9
1591
10
1571
11
1558
12
1554
13
1546
14
1505
15
1504
16
1497
17
1495
18
1482
19
1481
20
1467
21
1452
22
1438
23
1419
24
1417
25
1414
26
1406
27
1403
28
1395
29
1388
30
1354
31
1330
32
1324
33
1319
34
1314
35
1299
36
1298
37
1294
38
1288
39
1285
40
1285
41
1238
42
1227
43
1220
44
1194
45
1192
46
1192
47
1190
48
1185
49
1184
50
1168
51
1160
52
1125
53
1116
54
1116
55
1113
56
1080
57
1059
59
1038
60
1014
62
1000
63
989
64
985
65
947
66
928
67
919
68
919
69
906
70
877
71
865
72
865
73
864
75
861
76
825
77
783
78
783
79
776
81
762
82
753
83
741
84
738
85
736
86
681
87
649
88
620
89
614
90
586
91
559
92
526
93
445
94
435
95
410
96
387
97
378
98
357
99
348
100
346
101
328
102
323
103
318
104
302
105
294
106
289
107
285
108
270
109
269
110
266
111
265
112
258
113
255
114
238

FAQ

What does GDPval-AA measure?

An agentic real-world work-task evaluation reported as an Elo score in DeepSeek-V4 thinking-mode evaluations.

Which model scores highest on GDPval-AA?

Claude Opus 4.8 by Anthropic currently leads with a score of 1890 on GDPval-AA.

How many models are evaluated on GDPval-AA?

114 AI models have been evaluated on GDPval-AA on BenchLM.

Last updated: June 2, 2026 · BenchLM version GDPval-AA 2026

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.