Benchmark profile

Humanity's Last Exam with tools (HLE w/ tools)

Tool-augmented Humanity's Last Exam scores reported in DeepSeek-V4 thinking-mode evaluations.

Data verified July 23, 2026

Benchmark score on HLE w/ tools — July 23, 2026

BenchLM mirrors the published score view for HLE w/ tools. Claude Sonnet 5 leads the public snapshot at 57.4% , followed by Qwen3.7 Max (53.5%) and DeepSeek V4 Pro (Max) (48.2%). BenchLM does not use these results to rank models overall.

1Closed

Claude Sonnet 5

Anthropic

claude-sonnet-5

57.4%

Overall 65.32Context 1M

2Closed

Qwen3.7 Max

Alibaba

qwen3-7-max

53.5%

Overall 72.84Context 1M

3Open

DeepSeek V4 Pro (Max)

DeepSeek

deepseek-v4-pro-max

48.2%

Overall —Context 1M

9 modelsAgenticCurrentDisplay onlyUpdated July 23, 2026

Benchmark score table (9 models)

Score

Claude Sonnet 5Anthropic · Closed

57.4%

Qwen3.7 MaxAlibaba · Closed

53.5%

DeepSeek V4 Pro (Max)DeepSeek · Open weight

48.2%

Agents-A1InternScience · Open weight

47.6%

Step 3.7 FlashStepFun · Open weight

47.2%

DeepSeek V4 Flash (Max)DeepSeek · Open weight

45.1%

DeepSeek V4 Pro (High)DeepSeek · Open weight

44.7%

DeepSeek V4 Flash (High)DeepSeek · Open weight

40.3%

Nemotron 3 UltraNVIDIA · Open weight

37.4%

The published HLE w/ tools snapshot places Claude Sonnet 5 first at 57.4%. The third row is 9.2 points behind. The broader top-10 range is 20.0 points, so the table still separates the published systems.

9 models have been evaluated on HLE w/ tools. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. HLE w/ tools is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About HLE w/ tools

Year

2026

Tasks

Expert questions with tool use

Format

Pass@1

Difficulty

Frontier tool-augmented reasoning

BenchLM stores HLE w/ tools as a display-only provider-table row when exact values are published in DeepSeek-V4 evaluations.

DeepSeek-V4 Technical Report

BenchLM freshness & provenance

Version

HLE w/ tools 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does HLE w/ tools measure?

Tool-augmented Humanity's Last Exam scores reported in DeepSeek-V4 thinking-mode evaluations.

Which model scores highest on HLE w/ tools?

Claude Sonnet 5 by Anthropic currently leads with a score of 57.4% on HLE w/ tools.

How many models are evaluated on HLE w/ tools?

9 AI models have been evaluated on HLE w/ tools on BenchLM.

Compare Top Models on HLE w/ tools

Claude Sonnet 5 vs Qwen3.7 Max Qwen3.7 Max vs DeepSeek V4 Pro (Max)DeepSeek V4 Pro (Max) vs Agents-A1 Agents-A1 vs Step 3.7 Flash

Last updated: July 23, 2026 · BenchLM version HLE w/ tools 2026

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.