Agentic benchmark report

Best LLMs for Agentic — July 2026 Leaderboard

As of July 2026, the top agentic model on the BenchLM leaderboard is GPT-5.6 Sol with a BenchAlign agentic score of 75.2.

Data refreshed: July 23, 2026

Tool use, browser research, and computer-use workflows

Decision lens: the score determines position; Supported and Estimated labels describe the evidence behind that position without removing sparsely reported models.

Data refreshed: July 23, 2026
Ranked: 119 of 290 models
Supported / Estimated: 39 / 80
Weighted evidence: 3 of 27 benchmarks

27 tracked benchmarks

Terminal-Bench 2.0, BrowseComp, OSWorld-Verified, OSWorld 2.0, CyberGym, Cybench, ExploitGym, JobBench, BrowseComp-VL, OSWorld, AndroidWorld, WebVoyager, MCP Atlas, Toolathlon, Finance Agent v2, GDPval-AA, ZClawBench, Tau2-Telecom, DeepSearchQA, Tau2-Airline, PinchBench, OpenHands Index, SWE-Atlas Refactoring, BFCL v4, MLE-Bench Lite, MM-ClawBench, Gert Labs

Scope: Terminal/tool use, Browser research, Computer use

Evidence set: Terminal-Bench 2.0, BrowseComp, OSWorld-Verified, OSWorld 2.0, CyberGym, Cybench, ExploitGym, JobBench, BrowseComp-VL, OSWorld, AndroidWorld, WebVoyager, MCP Atlas, Toolathlon, Finance Agent v2, GDPval-AA, ZClawBench, Tau2-Telecom, DeepSearchQA, Tau2-Airline, PinchBench, OpenHands Index, SWE-Atlas Refactoring, BFCL v4, MLE-Bench Lite, MM-ClawBench, Gert Labs

Scope: Terminal/tool use, Browser research, Computer use

Best Agentic picks

BenchLM summaries for agentic plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.

How BenchLM scores these

Best AgenticGPT-5.6 SolOpenAI

75.2category score

Best open weightMiniMax M3MiniMax

69.75overall score

Best Agentic valueGPT-5.6 TerraOpenAI

$15output / 1M tokens

Fastest measuredGrok 4.20xAI

233tokens / sec

Largest useful contextGrok 4.20xAI

2Mcontext window

Agentic AI Leaderboard

Primary score: BenchAlign agentic score. Higher values rank first. Use the Show metric control to change the value shown in each row.

Updated July 23, 2026Embed leaderboard

Supported positions have diverse direct evidence. Estimated positions remain ranked but carry wider uncertainty.

Show metric

Filters

Supported positions have diverse direct evidence. Estimated positions remain ranked with wider uncertainty.


1 GPT-5.6 Sol OpenAI Supported	Closed	Reasoning	1M	$5.00 / $30.00	Not listed	Not listed	75.2%	75.22	91.9%	92.2%	—	62.6%	84.5%	—	33.7%	—	—	—	—	—	—	58%	—	1736	—	85.1%	—	—	—	—	—	—	—	—	—
2 GPT-5.6 Terra OpenAI Supported	Closed	Reasoning	1M	$2.50 / $15.00	Not listed	Not listed	73.1%	73.09	87.4%	87.5%	—	50.2%	81.8%	—	23.2%	—	—	—	—	—	—	53.1%	—	1581	—	86.3%	—	—	—	—	—	—	—	—	—
3 Muse Spark 1.1 Meta Supported	Closed	Reasoning	1M	Not listed	Not listed	Not listed	68.6%	68.61	80%	—	80.8%	14.2%	59.0%	92.9%	0.8%	54.7%	—	—	—	—	88.1%	75.6%	57.2%	1374	—	—	84.9%	—	—	—	—	—	—	—	—
4 Kimi K3 Moonshot AI Supported	Closed	Reasoning	1.05M	$3.00 / $15.00	Not listed	Not listed	66.6%	66.64	88.3%	91.2%	—	—	—	—	—	52.9%	—	—	—	—	84.2%	—	—	1679	—	—	95.0%	—	—	—	—	—	—	—	—
5 Claude Opus 4.7 (Adaptive) Anthropic Supported	Closed	Reasoning	1M	$5.00 / $25.00	Not listed	Not listed	65.9%	65.92	69.4%	79.3%	78%	18.2%	73.1%	—	—	45.9%	—	—	—	—	77.3%	—	—	1495	—	88.6%	—	—	—	—	—	—	—	—	—
6 GPT-5.5 OpenAI Supported	Closed	Reasoning	1M	$5.00 / $30.00	Not listed	Not listed	65.8%	65.78	82%	84.4%	78.7%	13.0%	81.8%	—	13.4%	42.7%	—	—	—	—	75.3%	55.6%	—	1490	—	93.9%	—	—	—	—	—	—	—	—	72.93%
7 Claude Mythos 5 Anthropic Supported	Closed	Reasoning	1M+	$10.00 / $50.00	Not listed	Not listed	65.0%	64.95	88%	88%	85%	—	—	—	17.5%	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
8 Claude Opus 4.8 Anthropic Supported	Closed	Reasoning	1M	$5.00 / $25.00	Not listed	Not listed	64.0%	64.02	74.6%	84.3%	83.4%	20.6%	—	—	—	—	—	—	—	—	82.2%	59.9%	53.9%	1594	—	94.4%	93.1%	—	—	—	—	—	—	—	72.97%
9 Claude Fable 5 Anthropic Supported	Closed	Reasoning	1M+	$10.00 / $50.00	Not listed	Not listed	63.9%	63.89	84.3%	—	85%	—	—	—	—	—	—	—	—	—	—	—	—	1748	—	98.5%	—	—	—	—	—	—	—	—	—
10 GPT-5.3 Codex OpenAI Estimated	Closed	Reasoning	400K	$1.75 / $14.00	79	88.26s	61.2%	61.24	77.3%	—	64.7%	—	—	—	—	33.7%	—	—	—	—	—	—	—	—	—	86%	—	—	—	—	—	—	—	—	57.47%
11 GPT-5.5 Pro OpenAI Estimated	Closed	Reasoning	1M	$30.00 / $180.00	Not listed	Not listed	60.5%	60.48	—	90.1%	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
12 Grok 4.5 xAI Supported	Closed	Reasoning	500K	$2.00 / $6.00	Not listed	Not listed	60%	60	83.3%	—	—	—	—	—	—	—	—	—	—	—	—	—	—	1535	—	—	—	—	—	—	—	—	—	—	—
13 Gemini 3 Pro Google Estimated	Closed	Standard	2M	$2.00 / $12.00	109	32.65s	59.7%	59.69	—	—	—	—	—	—	—	11.4%	—	—	—	—	—	—	—	—	—	87.1%	—	—	—	—	—	—	—	—	63.23%
14 Muse Spark Meta Estimated	Closed	Reasoning	262K	Not listed	Not listed	Not listed	59.4%	59.42	59%	—	—	—	43.5%	—	—	—	—	—	—	—	—	—	—	1144	—	91.5%	74.8%	—	—	—	—	—	—	—	—
15 Claude Sonnet 5 Anthropic Supported	Closed	Reasoning	1M	$2.00 / $10.00	Not listed	Not listed	58.7%	58.71	80.4%	84.7%	81.2%	—	—	—	—	—	—	—	—	—	—	—	—	1607	—	—	—	—	—	—	—	—	—	—	—
16 GPT-5.6 Luna OpenAI Supported	Closed	Reasoning	1M	$1.00 / $6.00	Not listed	Not listed	58.5%	58.52	84.7%	83.3%	—	45.6%	77.9%	—	12.4%	—	—	—	—	—	—	53.4%	—	1584	—	—	—	—	—	—	—	—	—	—	—
17 GPT-5.4 Pro OpenAI Estimated	Closed	Reasoning	1.05M	$30.00 / $180.00	74	151.79s	58.2%	58.19	—	89.3%	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
18 Qwen 3.6 Max (preview) Alibaba Estimated	Closed	Reasoning	256K	Not listed	Not listed	Not listed	57.2%	57.23	65.4%	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	95.9%	—	—	—	—	—	—	—	—	—
19 GPT-5.4 OpenAI Supported	Closed	Reasoning	1.05M	$2.50 / $15.00	74	151.79s	57%	57	75.1%	82.7%	75%	—	79.0%	—	6.0%	38.9%	—	—	—	—	70.6%	54.6%	—	1395	—	87.1%	73.6%	—	—	—	—	—	—	—	64.89%
20 Claude Opus 4.6 (Adaptive) Anthropic Estimated	Closed	Reasoning	1M	Not listed	Not listed	Not listed	55.4%	55.41	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	92.1%	—	—	—	—	—	—	—	—	—
21 Claude Opus 4.6 Anthropic Supported	Closed	Standard	1M	$5.00 / $25.00	40	1.78s	55.2%	55.16	65.4%	83.7%	74%	—	66.6%	—	—	36.7%	—	—	—	—	—	—	—	—	—	84.8%	73.7%	—	—	—	—	—	—	—	61.85%
22 Claude Opus 4.7 Anthropic Supported	Closed	Standard	1M	$5.00 / $25.00	Not listed	Not listed	54.8%	54.84	—	—	—	13.9%	—	—	—	—	—	—	—	—	—	—	—	—	—	74%	—	—	—	—	—	—	—	—	65.59%
23 GLM-5 Z.AI Estimated Self-host	Open	Standard	200K	$1.00 / $3.20	74	1.64s	54.8%	54.83	56.2%	—	—	—	43.2%	—	—	—	—	—	—	—	31.1%	38%	—	—	—	98.2%	—	—	—	—	—	—	—	—	50.99%
24 GLM-5.2 Z.AI Supported Self-host	Open	Reasoning	1M	$1.40 / $4.40	Not listed	Not listed	54.6%	54.62	81%	—	—	—	—	—	—	—	—	—	—	—	76.8%	48.2%	—	1514	—	99.1%	—	—	—	—	—	—	—	—	—
25 SWE-1.7 Cognition Estimated	Closed	Reasoning	256K	Not listed	Not listed	Not listed	54.4%	54.38	81.5%	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—

Showing 25 of 119

Rank / modelWeighted Agentic

GPT-5.6 SolOpenAI · ClosedSupported

75.2%

GPT-5.6 TerraOpenAI · ClosedSupported

73.1%

Muse Spark 1.1Meta · ClosedSupported

68.6%

Kimi K3Moonshot AI · ClosedSupported

66.6%

Claude Opus 4.7 (Adaptive)Anthropic · ClosedSupported

65.9%

GPT-5.5OpenAI · ClosedSupported

65.8%

Claude Mythos 5Anthropic · ClosedSupported

65.0%

Claude Opus 4.8Anthropic · ClosedSupported

64.0%

Claude Fable 5Anthropic · ClosedSupported

63.9%

GPT-5.3 CodexOpenAI · ClosedEstimated

61.2%

GPT-5.5 ProOpenAI · ClosedEstimated

60.5%

Grok 4.5xAI · ClosedSupported

60%

Gemini 3 ProGoogle · ClosedEstimated

59.7%

Muse SparkMeta · ClosedEstimated

59.4%

Claude Sonnet 5Anthropic · ClosedSupported

58.7%

GPT-5.6 LunaOpenAI · ClosedSupported

58.5%

GPT-5.4 ProOpenAI · ClosedEstimated

58.2%

Qwen 3.6 Max (preview)Alibaba · ClosedEstimated

57.2%

GPT-5.4OpenAI · ClosedSupported

57%

Claude Opus 4.6 (Adaptive)Anthropic · ClosedEstimated

55.4%

Claude Opus 4.6Anthropic · ClosedSupported

55.2%

Claude Opus 4.7Anthropic · ClosedSupported

54.8%

GLM-5Z.AI · Open weightEstimated

54.8%

GLM-5.2Z.AI · Open weightSupported

54.6%

SWE-1.7Cognition · ClosedEstimated

54.4%

Top AI Models for Agentic — July 2026

As of July 2026, GPT-5.6 Sol leads the BenchAlign agentic leaderboard with a score of 75.2, followed by GPT-5.6 Terra (73.1) and Muse Spark 1.1 (68.6). BenchLM is currently showing 39 Supported and 80 Estimated models in this category.

RankModelScoreEvidence and fit

GPT-5.6 SolOpenAI · Proprietary

75.2Score

Ranks #1 on the current agentic board with a Supported evidence label.

Supported

Terminal-Bench 2.0 91.9BrowseComp 92.2

GPT-5.6 TerraOpenAI · Proprietary

73.1Score

Ranks #2 on the current agentic board with a Supported evidence label.

Supported

Terminal-Bench 2.0 87.4BrowseComp 87.5

Muse Spark 1.1Meta · Proprietary

68.6Score

Ranks #3 on the current agentic board with a Supported evidence label.

Supported

Terminal-Bench 2.0 80OSWorld-Verified 80.8

What changed

GPT-5.6 Sol ranks #1 at 75.2 with a Supported evidence label.

GPT-5.6 Terra ranks #2 at 73.1 with a Supported evidence label.

Muse Spark 1.1 ranks #3 at 68.6 with a Supported evidence label.

How to choose

Highest current agentic score?GPT-5.6 Sol — #1 at 75.2Prioritizing evidence maturity?GPT-5.6 Sol — highest Supported positionNeed open weights?GLM-5 — highest-ranked open-weight model

Top models by benchmark

Agentic software engineering and terminal task completion benchmark(38% of category score)

RankModelReported score

1GPT-5.6 Sol91.9

2Kimi K3~88.3

3Claude Mythos 588

4GPT-5.6 Terra87.4

5GPT-5.6 Luna84.7

Score in Context

What these scores mean

BenchAlign places direct benchmarks and independent external signals on a common calibrated scale. The score is relative to the current evidence universe; it is not a raw percentage from any single test.

Known limitations

Estimated rows have less diverse direct evidence and wider uncertainty. They remain ranked so a newly released model is not treated as weak merely because fewer benchmark publishers have evaluated it.

How we weight

This lens combines category-relevant external evidence with admitted benchmark protocols. Evidence sources are calibrated for difficulty before aggregation, and no generated benchmark row contributes to the score.

Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.

The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.

Scroll horizontally to read the full evidence ledger.

Agentic benchmark weights, ranking status, and descriptions
Benchmark	Weight	Status	Description
Terminal-Bench 2.0	38%	Weighted	Agentic software engineering and terminal task completion benchmark
BrowseComp	28%	Weighted	Web research benchmark for browsing agents
OSWorld-Verified	34%	Weighted	Computer-use benchmark for GUI task completion
OSWorld 2.0	—	Display only	A long-horizon computer-use benchmark covering realistic workflows across everyday and professional desktop tasks.
CyberGym	—	Display only	Cybersecurity task benchmark for evaluating defensive cyber workflows and vulnerability-oriented agent performance.
Cybench	—	Display only	A cybersecurity benchmark of professional Capture the Flag tasks for measuring autonomous cyber agent capability and risk.
ExploitGym	—	Display only	A controlled benchmark for evaluating whether AI agents can extend vulnerability-triggering inputs into working exploits.
JobBench	—	Display only	An occupational agent benchmark for professional workflows that workers say they most want delegated to AI.
BrowseComp-VL	—	Display only	Vision-language browsing benchmark for multimodal web research and tool-use tasks.
OSWorld	—	Display only	Computer-use benchmark for GUI task completion across the broader OSWorld task suite.
AndroidWorld	—	Display only	Android GUI agent benchmark for task completion across mobile app workflows.
WebVoyager	—	Display only	Browser agent benchmark for completing multi-step workflows on live websites.
MCP Atlas	—	Display only	Tool-calling benchmark for Model Context Protocol integrations and multi-tool coordination
Toolathlon	—	Display only	General tool-calling benchmark for multi-step API and tool usage
Finance Agent v2	—	Display only	Financial analysis and decision-making benchmark for agentic expert tasks.
GDPval-AA	—	Display only	Real-world agentic knowledge-work evaluation reported as an Elo score.
ZClawBench	—	Display only	Z.AI's OpenClaw workflow benchmark for broad agent tasks across research, office work, data analysis, devops, automation, and security.
Tau2-Telecom	—	Display only	Telecom-focused tool-use benchmark for structured API workflows
DeepSearchQA	—	Display only	Agentic browsing benchmark for list-style question answering with browser tools.
Tau2-Airline	—	Display only	Airline-domain tool-use benchmark for structured workflow execution and API correctness.
PinchBench	—	Display only	An OpenClaw agent benchmark from Kilo that measures successful task completion across standardized real-world agent workflows.
OpenHands Index	—	Display only	A holistic coding-agent benchmark that evaluates AI agents across issue resolution, frontend work, greenfield development, testing, and information gathering.
SWE-Atlas Refactoring	—	Display only	A Scale SWE-Atlas software-engineering agent benchmark focused on refactoring tasks.
BFCL v4	—	Display only	Function-calling benchmark for tool selection, schema adherence, and argument correctness.
MLE-Bench Lite	—	Display only	A lightweight machine-learning competition benchmark that measures whether models can iteratively train, evaluate, and improve ML systems in low-resource settings.
MM-ClawBench	—	Display only	An OpenClaw-derived agent benchmark covering practical work and life tasks such as office document delivery, research, planning, and code maintenance.
Gert Labs	—	Display only	Composite game-environment leaderboard score across Gert Labs agentic coding, one-shot coding, and social decision-making modes.

About Agentic Benchmarks

Agentic software engineering and terminal task completion benchmark

Common questions

What is an agentic LLM benchmark?

Agentic benchmarks evaluate whether AI models can complete multi-step workflows using tools, browsers, terminals, or software interfaces instead of only answering in chat.

Which benchmarks matter for AI agents?

Key agentic benchmarks include Terminal-Bench 2.0 for terminal tasks, BrowseComp for web research, and OSWorld-Verified for computer-use workflows.

Why do agentic benchmarks matter in 2026?

Agentic benchmarks matter because many modern products rely on models that can browse, plan, use tools, and complete end-to-end tasks rather than only generate text.

Agentic benchmark updates

Agentic is the fastest-moving category. Don't fall behind.

One email each week. Unsubscribe anytime.

Best LLMs Overall

Top models ranked across all benchmark categories.

View

Coding Benchmarks

How models perform on SWE-bench and LiveCodeBench.

View

Best Open-Weight Models

Top open-source models for agentic workloads.

View

AI Cost Calculator

Compare pricing for agentic model usage.

View

Best LLMs for Agentic — July 2026 Leaderboard

Best Agentic picks

Agentic AI Leaderboard

Top AI Models for Agentic — July 2026

What changed

How to choose

Top models by benchmark

Score in Context

What these scores mean

Known limitations

How we weight

About Agentic Benchmarks

Common questions

Agentic benchmark updates

Related