Skip to main content
Skip to main content
Agentic

Agentic Benchmarks — Terminal, Browsing & Computer Use Leaderboard

Tool use, browser research, and computer-use workflows

Bottom line: Claude Mythos Preview has a perfect agentic score, but GPT-5.4 is close behind and significantly cheaper for production workloads.

Terminal-Bench 2.0 · BrowseComp · OSWorld-Verified · CyberGym · BrowseComp-VL · OSWorld · AndroidWorld · WebVoyager · MCP Atlas · Toolathlon · Finance Agent v2 · GDPval-AA · ZClawBench · Tau2-Telecom · DeepSearchQA · Tau2-Airline · PinchBench · OpenHands Index · SWE-Atlas Refactoring · BFCL v4 · MLE-Bench Lite · MM-ClawBench · Gert Labs

Terminal/tool useBrowser researchComputer use

Best Agentic picks

BenchLM summaries for agentic plus the practical tradeoffs users check next: open weights, price, speed, latency, and context.

Top AI Models for AgenticJune 2026

As of June 2026, Claude Mythos Preview leads the provisional agentic leaderboard with a weighted score of 100.0%, followed by GPT-5.5 (98.0%) and Claude Opus 4.8 (97.7%). BenchLM is currently showing 101 provisional-ranked models and 12 verified-ranked models in this category.

What changed

Claude Mythos Preview debuted at #1 with a 100.0 weighted agentic score — the first model to achieve this.

GPT-5.4 holds #2 at 93.5%, strong on both Terminal-Bench and BrowseComp.

Claude Opus 4.6 remains #3 at 92.6%, with the most consistent scores across all agentic sub-benchmarks.

How to choose

Top models by benchmark

Agentic software engineering and terminal task completion benchmark(28% of category score)

Agentic AI Leaderboard

Updated June 2, 2026

Sorted by agentic weighted score. Switch between provisional-ranked and verified-ranked modes to see the broader public dataset versus sourced-only ranking. Click column headers to re-sort by overall score or any benchmark.

101 ranked models
CSVJSON
Provisional-ranked mode includes source-unverified non-generated benchmark evidence.P = provisional benchmark row
100%
99
82%86.9%79.6%83.1%
2
GPT-5.5
OpenAI
98%
91
82%84.4%78.7%81.8%75.3%55.6%176993.9%72.93%
3
97.7%
95
74.6%84.3%83.4%82.2%59.9%53.9%189094.4%93.1%72.97%
96.9%
87
76.2%78.4%83.6%56.5%57.9%165695.3%61.85%
95.3%
Est.90
1324
92%
91
89.3%
7
90.1%
Est.83
88.7%
85
69.4%79.3%78%73.1%77.3%175388.6%
9
GPT-5.4
OpenAI
87.4%
89
75.1%82.7%75%79.0%70.6%54.6%167487.1%73.6%64.89%
10
87.1%
84
66.7%83.2%73.1%55.9%50%148195.9%92.5%56.82%
11
84.4%
87
65.4%83.7%74%66.6%159184.8%73.7%61.85%
83.3%
92
131495.6%69.7%56.87%
82.4%
76
66%83.5%70.1%74.2%
80.8%
83
59.1%72.1%65.2%159979.5%62.92%
79.2%
Est.86
77.3%64.7%148286%57.47%
16
GLM-5 (Reasoning)
Z.AI
Self-host
79.2%
Est.80
78.9%
Est.90
18
76.7%
Est.77
129484.8%
19
GPT-5.1
OpenAI
75.8%
Est.78
122781.9%41.24%
20
74.8%
76
59.3%66.3%50.6%66.3%42.3%43.5%141986.3%64.23%
21
72.1%
81
118487.1%63.23%
22
Qwen3.5 397B (Reasoning)
Alibaba
Self-host
71%
Est.78
119095.6%
23
69.5%
73
61.6%48.2%39.8%135497.7%50.60%
69.5%
Est.70
100086.5%
63.5%
76
50.8%60.6%128595.9%32.58%
Showing 25 of 101

These rankings update weekly

Get notified when models move. One email a week with what changed and why.

Free. No spam. Unsubscribe anytime.

Score in Context

What these scores mean

Agentic carries the highest weight at 22% in BenchLM.ai's overall scoring — reflecting that browse-and-do workflows now matter more than raw chat fluency. The weighted score blends Terminal-Bench 2.0, BrowseComp, and OSWorld-Verified. A 5-point gap means the difference between an agent that reliably completes multi-step tasks and one that stalls midway.

Known limitations

Agentic benchmarks are newer and less standardized than coding or knowledge tests. Terminal-Bench and BrowseComp use different evaluation harnesses, so cross-benchmark comparison requires care. Some models lack agentic benchmark data entirely and are excluded from rankings rather than estimated.

How we weight

Agentic capability carries a 22% weight in BenchLM.ai's overall scoring — the single biggest contributor, reflecting that browse-and-do workflows now matter more than raw chat fluency.

Agentic benchmarks test whether an AI model can do work, not just talk about it — opening tools, gathering evidence, navigating software, and staying coherent over long action chains. See the agentic leaderboard or compare with coding benchmarks.

Leaderboards exclude benchmark rows that BenchLM generated from other scores or cloned from reference models. When a weighted benchmark is missing after that filter, the category falls back to the remaining trustworthy public rows instead of filling the gap with synthetic values.

The full scoring rules, freshness handling, and runtime/pricing caveats live on the BenchLM methodology page.

BenchmarkWeightStatusDescription
Terminal-Bench 2.028%WeightedAgentic software engineering and terminal task completion benchmark
BrowseComp18%WeightedWeb research benchmark for browsing agents
OSWorld-Verified24%WeightedComputer-use benchmark for GUI task completion
CyberGymDisplay onlyCybersecurity task benchmark for evaluating defensive cyber workflows and vulnerability-oriented agent performance.
BrowseComp-VLDisplay onlyVision-language browsing benchmark for multimodal web research and tool-use tasks.
OSWorldDisplay onlyComputer-use benchmark for GUI task completion across the broader OSWorld task suite.
AndroidWorldDisplay onlyAndroid GUI agent benchmark for task completion across mobile app workflows.
WebVoyagerDisplay onlyBrowser agent benchmark for completing multi-step workflows on live websites.
MCP AtlasDisplay onlyTool-calling benchmark for Model Context Protocol integrations and multi-tool coordination
ToolathlonDisplay onlyGeneral tool-calling benchmark for multi-step API and tool usage
Finance Agent v2Display onlyFinancial analysis and decision-making benchmark for agentic expert tasks.
GDPval-AADisplay onlyReal-world agentic knowledge-work evaluation reported as an Elo score.
ZClawBenchDisplay onlyZ.AI's OpenClaw workflow benchmark for broad agent tasks across research, office work, data analysis, devops, automation, and security.
Tau2-TelecomDisplay onlyTelecom-focused tool-use benchmark for structured API workflows
DeepSearchQADisplay onlyAgentic browsing benchmark for list-style question answering with browser tools.
Tau2-AirlineDisplay onlyAirline-domain tool-use benchmark for structured workflow execution and API correctness.
PinchBenchDisplay onlyAn OpenClaw agent benchmark from Kilo that measures successful task completion across standardized real-world agent workflows.
OpenHands IndexDisplay onlyA holistic coding-agent benchmark that evaluates AI agents across issue resolution, frontend work, greenfield development, testing, and information gathering.
SWE-Atlas RefactoringDisplay onlyA Scale SWE-Atlas software-engineering agent benchmark focused on refactoring tasks.
BFCL v4Display onlyFunction-calling benchmark for tool selection, schema adherence, and argument correctness.
MLE-Bench LiteDisplay onlyA lightweight machine-learning competition benchmark that measures whether models can iteratively train, evaluate, and improve ML systems in low-resource settings.
MM-ClawBenchDisplay onlyAn OpenClaw-derived agent benchmark covering practical work and life tasks such as office document delivery, research, planning, and code maintenance.
Gert LabsDisplay onlyComposite game-environment leaderboard score across Gert Labs agentic coding, one-shot coding, and social decision-making modes.

Agentic benchmark updates

Agentic is the fastest-moving category. Don't fall behind.

Free. No spam. Unsubscribe anytime.

About Agentic Benchmarks

Agentic software engineering and terminal task completion benchmark

Related