Benchmark profile

OSWorld-Verified

OSWorld-Verified is the July 2025 repaired release of OSWorld's real-computer evaluation. It measures whether a model-agent system can finish desktop and web tasks from configured starting states, with success checked by execution-based evaluators.

Data verified July 23, 2026

How to read this leaderboard

Editorial review by Glevd · 2026-07-15

Read this table as a ledger of sourced model-agent results. Match the task count, environment revision, model and agent variant, general model versus specialized model versus agentic framework, screenshot or accessibility-tree observation, action interface, prompt and scaffold, action budget, and attempt policy before comparing scores. The benchmark-hosted table uses unified settings; provider-reported rows are not automatically part of that controlled comparison.

Operator receipt: 27 sourced rows are currently displayable on this page; the leading published row is Claude Mythos 5 at 85%.

Honest limit: A success rate reflects the complete model-agent system, not the base model alone. Fixed applications and configured starting states do not reproduce every production login, permission boundary, network failure, application update, latency, cost, or safety control. Eight Google Drive tasks may be excluded under the official policy, and rows without displayable exact-source records remain withheld. OSWorld 2.0 results are not interchangeable with OSWorld-Verified results.

Computer-use rankings Agentic rankings OSWorld 2.0 results How benchmark confidence works

OSWorld-Verified project and results Original OSWorld paper Official OSWorld implementation OSWorld 2.0 project

Top models on OSWorld-Verified — July 23, 2026

As of July 23, 2026, Claude Mythos 5 leads the OSWorld-Verified leaderboard with 85% , followed by Claude Fable 5 (85%) and Claude Opus 4.8 (83.4%).

1Closed

Claude Mythos 5

Anthropic

claude-mythos-5

85%

Overall 83.93Context 1M+

2Closed

Claude Fable 5

Anthropic

claude-fable-5

85%

Overall 83.68Context 1M+

3Closed

Claude Opus 4.8

Anthropic

claude-opus-4-8

83.4%

Overall 78.34Context 1M

27 modelsAgentic34% of category scoreCurrentUpdated July 23, 2026

Leaderboard (27 models)

Score

Claude Mythos 5Anthropic · Closed

85%

Claude Fable 5Anthropic · Closed

85%

Claude Opus 4.8Anthropic · Closed

83.4%

Gemini 3.6 FlashGoogle · Closed

83%

Holo3-35B-A3BH Company · Open weight

82.6%

Claude Sonnet 5Anthropic · Closed

81.2%

Muse Spark 1.1Meta · Closed

80.8%

Holo3-122B-A10BH Company · Closed

78.8%

GPT-5.5OpenAI · Closed

78.7%

Gemini 3.5 FlashGoogle · Closed

78.4%

Claude Opus 4.7 (Adaptive)Anthropic · Closed

78%

GPT-5.4OpenAI · Closed

75%

Gemini 3.5 Flash-LiteGoogle · Closed

74%

Qwen3.7 PlusAlibaba · Closed

73.3%

Kimi K2.6Moonshot AI · Open weight

73.1%

Claude Opus 4.6Anthropic · Closed

72.7%

Claude Sonnet 4.6Anthropic · Closed

72.1%

GPT-5.4 miniOpenAI · Closed

72.1%

MiniMax M3MiniMax · Open weight

70.1%

Claude Opus 4.5Anthropic · Closed

66.3%

GPT-5.3 CodexOpenAI · Closed

64.7%

Claude Sonnet 4.5Anthropic · Closed

61.4%

Qwen3.5-122B-A10BAlibaba · Open weight

58%

Qwen3.5-27BAlibaba · Open weight

56.2%

Qwen3.5-35B-A3BAlibaba · Open weight

54.5%

GPT-5.2OpenAI · Closed

47.3%

GPT-5.4 nanoOpenAI · Closed

39%

According to BenchLM.ai, Claude Mythos 5 leads the OSWorld-Verified benchmark with a score of 85%, followed by Claude Fable 5 (85%) and Claude Opus 4.8 (83.4%). The top models are clustered within 1.6 points, suggesting this benchmark is nearing saturation for frontier models.

27 models have been evaluated on OSWorld-Verified. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. Within that category, OSWorld-Verified contributes 34% of the category score, so strong performance here directly affects a model's overall ranking.

About OSWorld-Verified

Year

2025

Tasks

369 real-world computer tasks (361 when eight Google Drive tasks are excluded)

Format

Execution-based interactive task success

Difficulty

Multi-step desktop and cross-application workflows

The release covers 369 real-world tasks across desktop and web applications. The maintainers allow eight Google Drive tasks to be manually configured or excluded, making a 361-task run officially acceptable. Public evaluation requires the maintainers to run the agent or review monitoring data and trajectories. OSWorld 2.0 is a newer, separate protocol.

OSWorld

BenchLM freshness & provenance

Version

OSWorld Verified

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does OSWorld-Verified measure?

OSWorld-Verified measures whether a computer-use agent can complete 369 tasks across real desktop and web applications. Each task starts from a configured state and is checked with an execution-based evaluator. Eight Google Drive tasks may be manually configured or excluded, producing an officially accepted 361-task run.

Are OSWorld-Verified scores directly comparable?

Only when the setup matches. Compare the same task set, environment revision, model-agent variant, observation and action interface, prompt or scaffold, action budget, and attempt policy. The official leaderboard separates general models, specialized models, and agentic frameworks; provider-published scores may use different conditions.

Can OSWorld-Verified choose the best computer-use model?

Not alone. It is strong evidence for desktop task completion, but fixed applications cannot reproduce every production login, permission, network, app-version, latency, cost, or safety condition. Use matched OSWorld-Verified results with workflow trials, and treat OSWorld 2.0 as a separate newer protocol rather than interchangeable evidence.

Compare Top Models on OSWorld-Verified

Claude Mythos 5 vs Claude Fable 5 Claude Fable 5 vs Claude Opus 4.8 Claude Opus 4.8 vs Gemini 3.6 Flash Gemini 3.6 Flash vs Holo3-35B-A3B

Learn More

Read our explainer: OSWorld-Verified benchmark deep dive

Last updated: July 23, 2026 · BenchLM version OSWorld Verified

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.