Benchmark profile

GDPval-AA

An agentic real-world work-task evaluation reported as an Elo score in DeepSeek-V4 thinking-mode evaluations.

Data verified July 23, 2026

Benchmark score on GDPval-AA — July 23, 2026

BenchLM mirrors the published score view for GDPval-AA. Claude Fable 5 leads the public snapshot at 1748 , followed by GPT-5.6 Sol (1736) and Kimi K3 (1679). BenchLM does not use these results to rank models overall.

1Closed

Claude Fable 5

Anthropic

claude-fable-5

1748

Overall 83.68Context 1M+

2Closed

GPT-5.6 Sol

OpenAI

gpt-5-6-sol

1736

Overall 81.96Context 1M

3Closed

Kimi K3

Moonshot AI

kimi-3

1679

Overall 80.96Context 1.05M

82 modelsAgenticCurrentDisplay onlyUpdated July 23, 2026

Benchmark score table (82 models)

Score

Claude Fable 5Anthropic · Closed

1748

GPT-5.6 SolOpenAI · Closed

1736

Kimi K3Moonshot AI · Closed

1679

Claude Sonnet 5Anthropic · Closed

1607

Claude Opus 4.8Anthropic · Closed

1594

GPT-5.6 LunaOpenAI · Closed

1584

GPT-5.6 TerraOpenAI · Closed

1581

Grok 4.5xAI · Closed

1535

GLM-5.2Z.AI · Open weight

1514

Claude Opus 4.7 (Adaptive)Anthropic · Closed

1495

GPT-5.5OpenAI · Closed

1490

Gemini 3.6 FlashGoogle · Closed

1421

GPT-5.4OpenAI · Closed

1395

MiniMax M3MiniMax · Open weight

1395

Muse Spark 1.1Meta · Closed

1374

Gemini 3.5 FlashGoogle · Closed

1349

DeepSeek V4 Pro (Max)DeepSeek · Open weight

1307

DeepSeek V4 Pro (High)DeepSeek · Open weight

1299

Qwen3.7 MaxAlibaba · Closed

1273

MiMo-V2.5-ProXiaomi · Closed

1265

GLM-5.1Z.AI · Open weight

1257

InklingThinking Machines Lab · Open weight

1239

Hy3 PreviewTencent · Open weight

1214

Hy3Tencent · Open weight

1214

Kimi K2.6Moonshot AI · Open weight

1189

DeepSeek V4 Flash (Max)DeepSeek · Open weight

1189

Kimi K2.7 CodeMoonshot AI · Open weight

1187

GPT-5.4 miniOpenAI · Closed

1171

GLM-4.7Z.AI · Open weight

1165

Nemotron 3 UltraNVIDIA · Open weight

1164

MiniMax M2.7MiniMax · Open weight

1158

DeepSeek V4 Flash (High)DeepSeek · Open weight

1147

Muse SparkMeta · Closed

1144

Qwen3.6-27BAlibaba · Open weight

1140

Gemini 3.5 Flash-LiteGoogle · Closed

1140

Qwen3.6 PlusAlibaba · Closed

1135

GPT-5.4 nanoOpenAI · Closed

1100

Grok 4.3xAI · Closed

1085

GPT-5 (high)OpenAI · Closed

1075

Qwen3.6-35B-A3BAlibaba · Open weight

1049

Step 3.7 FlashStepFun · Open weight

1017

Kimi K2.5Moonshot AI · Open weight

1009

Kimi K2.5 (Reasoning)Moonshot AI · Closed

1009

GPT-5.1OpenAI · Closed

987

Qwen3.5-122B-A10BAlibaba · Open weight

978

Gemini 3.1 ProGoogle · Closed

965

Qwen3.5 397BAlibaba · Open weight

962

Qwen3.5 397B (Reasoning)Alibaba · Open weight

962

GPT-5 miniOpenAI · Closed

937

Qwen3.7 PlusAlibaba · Closed

936

Mistral Medium 3.5 128BMistral · Open weight

929

MiMo-V2-FlashXiaomi · Open weight

833

Gemma 4 31BGoogle · Open weight

804

GPT-OSS 120BOpenAI · Open weight

799

Gemma 4 26B A4BGoogle · Open weight

761

Command A+Cohere · Open weight

714

Mercury 2Inception · Closed

698

Nemotron 3 Super 120B A12BNVIDIA · Open weight

693

Gemini 2.5 ProGoogle · Closed

665

Gemini 3.1 Flash-LiteGoogle · Closed

642

Mistral Large 3Mistral · Closed

633

Mistral Small 4Mistral · Open weight

588

Mistral Small 4 (Reasoning)Mistral · Open weight

588

GPT-OSS 20BOpenAI · Open weight

559

Trinity-Large-PreviewArcee AI · Open weight

554

Trinity-Large-ThinkingArcee AI · Open weight

554

Ling 2.6 FlashInclusionAI · Open weight

545

GPT-4.1 miniOpenAI · Closed

503

Nemotron 3 Nano 30BNVIDIA · Open weight

484

Ministral 3 14B (Reasoning)Mistral · Open weight

476

Ministral 3 14BMistral · Open weight

476

Nemotron 3 Nano Omni 30B A3BNVIDIA · Open weight

467

Ministral 3 8B (Reasoning)Mistral · Open weight

449

Ministral 3 8BMistral · Open weight

449

Ministral 3 3B (Reasoning)Mistral · Open weight

273

Ministral 3 3BMistral · Open weight

273

GPT-4o miniOpenAI · Closed

226

DeepSeek V3DeepSeek · Open weight

217

Llama 4 ScoutMeta · Open weight

GPT-4.1 nanoOpenAI · Closed

Llama 4 MaverickMeta · Open weight

-16

Gemma 3 27BGoogle · Open weight

-144

The published GDPval-AA snapshot places Claude Fable 5 first at 1748. The third row is 69 score units behind. The broader top-10 range is 253 score units, so the table still separates the published systems.

82 models have been evaluated on GDPval-AA. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. GDPval-AA is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About GDPval-AA

Year

2026

Tasks

Agentic real-world work tasks

Format

Elo

Difficulty

Professional agentic workflows

BenchLM stores GDPval-AA as a display-only provider-table row for DeepSeek-V4 because the source reports an Elo score rather than a 0-100 percentage.

DeepSeek-V4 Technical Report

BenchLM freshness & provenance

Version

GDPval-AA 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does GDPval-AA measure?

An agentic real-world work-task evaluation reported as an Elo score in DeepSeek-V4 thinking-mode evaluations.

Which model scores highest on GDPval-AA?

Claude Fable 5 by Anthropic currently leads with a score of 1748 on GDPval-AA.

How many models are evaluated on GDPval-AA?

82 AI models have been evaluated on GDPval-AA on BenchLM.

Compare Top Models on GDPval-AA

Claude Fable 5 vs GPT-5.6 Sol GPT-5.6 Sol vs Kimi K3 Kimi K3 vs Claude Sonnet 5 Claude Sonnet 5 vs Claude Opus 4.8

Last updated: July 23, 2026 · BenchLM version GDPval-AA 2026

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.