Skip to main content

GDPval-AA normalized (GDPval-AA)

A display-only Artificial Analysis normalized score for economically valuable tasks.

Benchmark score on GDPval-AA — June 2, 2026

BenchLM mirrors the published score view for GDPval-AA. Claude Opus 4.8 leads the public snapshot at 69.5% , followed by GPT-5.5 (63.5%) and Claude Opus 4.7 (Adaptive) (62.6%). BenchLM does not use these results to rank models overall.

115 modelsAgenticCurrentDisplay onlyUpdated June 2, 2026

The published GDPval-AA snapshot is tightly clustered at the top: Claude Opus 4.8 sits at 69.5%, while the third row is only 6.9 points behind. The broader top-10 spread is 15.9 points, so the benchmark still separates strong models even when the leaders cluster.

115 models have been evaluated on GDPval-AA. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. GDPval-AA is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About GDPval-AA

Year

2026

Tasks

Economically valuable tasks

Format

Normalized score

Difficulty

Professional agentic workflows

OpenRouter's Grok 4.3 benchmark card displays GDPval-AA as a normalized percentage. BenchLM stores it separately from the Elo-style GDPval-AA rows used in provider comparison tables.

BenchLM freshness & provenance

Version

GDPval-AA 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Benchmark score table (115 models)

1
69.5%
2
63.5%
3
62.6%
4
58.9%
5
58.7%
6
57.8%
7
55.9%
8
54.9%
9
54.5%
10
53.6%
11
52.9%
12
52.7%
13
52.3%
14
51.8%
15
50.2%
16
50.2%
17
49.8%
18
49.8%
19
49.1%
20
49.1%
21
48.3%
22
47.6%
23
46.9%
24
45.9%
25
45.9%
26
45.7%
27
45.3%
28
45.1%
29
44.7%
30
44.4%
31
42.7%
32
41.5%
33
41.2%
34
41.0%
35
40.7%
36
39.9%
37
39.9%
38
39.7%
39
39.4%
40
39.2%
41
39.2%
42
36.9%
43
36.3%
44
36.0%
45
34.7%
46
34.6%
47
34.6%
48
34.5%
49
34.2%
50
34.2%
51
33.4%
52
33.0%
53
31.3%
54
30.8%
55
30.8%
56
30.7%
57
29.0%
58
28.0%
60
26.9%
61
25.7%
62
25.7%
63
25.0%
64
24.5%
65
24.2%
66
22.4%
67
21.4%
68
20.9%
69
20.9%
70
20.3%
71
18.8%
72
18.3%
73
18.3%
74
18.2%
75
18.0%
76
18.0%
77
16.3%
78
14.2%
79
14.1%
80
13.8%
82
13.1%
83
12.7%
84
12.0%
85
11.9%
86
11.8%
87
9.0%
88
7.5%
89
6.0%
90
5.7%
91
4.3%
92
3.0%
93
1.3%
94
0.0%
95
0.0%
96
0.0%
97
0.0%
98
0.0%
99
0.0%
100
0.0%
101
0.0%
102
0.0%
103
0.0%
104
0.0%
105
0.0%
106
0.0%
107
0.0%
108
0.0%
109
0.0%
110
0.0%
111
0.0%
112
0.0%
113
0.0%
114
0.0%
115
0.0%

FAQ

What does GDPval-AA measure?

A display-only Artificial Analysis normalized score for economically valuable tasks.

Which model scores highest on GDPval-AA?

Claude Opus 4.8 by Anthropic currently leads with a score of 69.5% on GDPval-AA.

How many models are evaluated on GDPval-AA?

115 AI models have been evaluated on GDPval-AA on BenchLM.

Last updated: June 2, 2026 · BenchLM version GDPval-AA 2026

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.