Benchmark profile

SWE-bench Pro

A long-horizon repository benchmark built to test realistic software engineering work. Its scores need a task-quality and setup check before they support a coding-agent decision.

Data verified July 23, 2026

Claude Mythos 5 leads the SWE-bench Pro leaderboard on BenchLM's July 2026 update with 80.3%, ahead of Claude Fable 5 (80%) and Sakana Fugu-Ultra (73.7%), across 53 tracked models.

How to read this leaderboard

Editorial review by Glevd · 2026-07-15

Use SWE-bench Pro as evidence for long-horizon repository work only after matching the split, scaffold, tool budget, retry policy, and run count. The table preserves exact published rows from different providers, so a small score gap is directional unless the setups match.

Operator receipt: 53 sourced rows are currently displayable on this page; the leading published row is Claude Mythos 5 at 80.3%.

Honest limit: OpenAI's July 2026 audit estimated that about 30% of the 731-task public split is broken and retracted its earlier recommendation to adopt the benchmark. The current rows remain useful published receipts, but this page should not decide a coding-agent purchase by itself.

Calibrated coding ranking LiveCodeBench leaderboard How benchmark confidence works

SWE-Bench Pro paper Official code and reproduction guide July 2026 task-quality audit

Top models on SWE-bench Pro — July 23, 2026

As of July 23, 2026, Claude Mythos 5 leads the SWE-bench Pro leaderboard with 80.3% , followed by Claude Fable 5 (80%) and Sakana Fugu-Ultra (73.7%).

1Closed

Claude Mythos 5

Anthropic

claude-mythos-5

80.3%

Overall 83.93Context 1M+

2Closed

Claude Fable 5

Anthropic

claude-fable-5

80%

Overall 83.68Context 1M+

3Closed

Sakana Fugu-Ultra

Sakana AI

sakana-fugu-ultra

73.7%

Overall —Context 1M

53 modelsCoding10% of category scoreCurrentUpdated July 23, 2026

Leaderboard (53 models)

Score

Claude Mythos 5Anthropic · Closed

80.3%

Claude Fable 5Anthropic · Closed

80%

Sakana Fugu-UltraSakana AI · Closed

73.7%

Claude Opus 4.8Anthropic · Closed

69.2%

Grok 4.5xAI · Closed

64.7%

GPT-5.6 SolOpenAI · Closed

64.6%

Claude Opus 4.7 (Adaptive)Anthropic · Closed

64.3%

GPT-5.6 TerraOpenAI · Closed

63.4%

Claude Sonnet 5Anthropic · Closed

63.2%

GPT-5.6 LunaOpenAI · Closed

62.7%

Ornith-1.0-397BDeepReinforce AI · Open weight

62.2%

GLM-5.2Z.AI · Open weight

62.1%

Muse Spark 1.1Meta · Closed

61.5%

Qwen3.7 MaxAlibaba · Closed

60.6%

Laguna S 2.1Poolside · Open weight

59.4%

MiniMax M3MiniMax · Open weight

59%

Sakana FuguSakana AI · Closed

59%

GPT-5.5OpenAI · Closed

58.6%

Kimi K2.6Moonshot AI · Open weight

58.6%

GLM-5.1Z.AI · Open weight

58.4%

GPT-5.4OpenAI · Closed

57.7%

Qwen3.7 PlusAlibaba · Closed

57.6%

Qwen 3.6 Max (preview)Alibaba · Closed

57.3%

MiMo-V2.5-ProXiaomi · Closed

57.2%

Claude Opus 4.5Anthropic · Closed

57.1%

GPT-5.3 CodexOpenAI · Closed

56.8%

Qwen3.6 PlusAlibaba · Closed

56.6%

Step 3.7 FlashStepFun · Open weight

56.3%

MiniMax M2.7MiniMax · Open weight

56.2%

MiMo-V2.5Xiaomi · Closed

56.1%

GPT-5.2OpenAI · Closed

55.6%

DeepSeek V4 Pro (Max)DeepSeek · Open weight

55.4%

Gemini 3.5 FlashGoogle · Closed

55.1%

GLM-5Z.AI · Open weight

55.1%

DeepSeek V4 Pro (High)DeepSeek · Open weight

54.4%

InklingThinking Machines Lab · Open weight

54.3%

Gemini 3.5 Flash-LiteGoogle · Closed

54.2%

Qwen3.6-27BAlibaba · Open weight

53.5%

Claude Opus 4.6Anthropic · Closed

53.4%

MAI-Thinking-1Microsoft · Closed

52.8%

DeepSeek V4 Flash (Max)DeepSeek · Open weight

52.6%

Muse SparkMeta · Closed

52.4%

DeepSeek V4 Flash (High)DeepSeek · Open weight

52.3%

DeepSeek V4 ProDeepSeek · Open weight

52.1%

Grok 4.20xAI · Closed

51.8%

Qwen3.5 397BAlibaba · Open weight

50.9%

Kimi K2.5Moonshot AI · Open weight

50.7%

Ornith-1.0-35BDeepReinforce AI · Open weight

50.4%

Qwen3.6-35B-A3BAlibaba · Open weight

49.5%

Laguna M.1Poolside · Closed

49.2%

DeepSeek V4 FlashDeepSeek · Open weight

49.1%

Laguna XS.2Poolside · Open weight

46.3%

Ornith-1.0-9BDeepReinforce AI · Open weight

42.9%

According to BenchLM.ai, Claude Mythos 5 leads the SWE-bench Pro benchmark with a score of 80.3%, followed by Claude Fable 5 (80%) and Sakana Fugu-Ultra (73.7%). The scores show moderate spread, with meaningful differences between the top tier and mid-tier models.

53 models have been evaluated on SWE-bench Pro. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. Within that category, SWE-bench Pro contributes 10% of the category score, so strong performance here directly affects a model's overall ranking.

About SWE-bench Pro

Year

2025

Tasks

1,865 repository problems

Format

Repository task completion

Difficulty

Long-horizon professional engineering

The authors assembled 1,865 problems from 41 repositories across public, held-out, and commercial splits. Agents receive a repository and issue, then produce a patch that must pass the evaluation tests without breaking existing behavior.

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

BenchLM freshness & provenance

Version

SWE-bench Pro 2025

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does SWE-bench Pro measure?

SWE-bench Pro gives an agent a repository and issue description, then checks whether its patch passes new tests without breaking existing behavior. The full benchmark contains 1,865 problems from 41 repositories across public, held-out, and commercial splits, with work that can span multiple files and long execution horizons.

Are SWE-bench Pro scores directly comparable?

Only when the split and evaluation setup match. Public, held-out, and commercial tasks are different pools, while scaffold, tool budget, retry policy, and token budget can change pass rates. This page preserves exact published rows, but it does not pretend every provider ran the same harness.

Should SWE-bench Pro decide which coding agent to use?

No. The benchmark covers realistic repository work, but OpenAI's July 2026 audit estimated that about 30% of the public tasks are broken and retracted its earlier adoption recommendation. Use SWE-bench Pro alongside LiveCodeBench, other repository evaluations, and a workload-specific trial instead of treating one score as a procurement decision.

Compare Top Models on SWE-bench Pro

Claude Mythos 5 vs Claude Fable 5 Claude Fable 5 vs Sakana Fugu-Ultra Sakana Fugu-Ultra vs Claude Opus 4.8 Claude Opus 4.8 vs Grok 4.5

Last updated: July 23, 2026 · BenchLM version SWE-bench Pro 2025

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.