Benchmark profile

PinchBench

An OpenClaw agent benchmark from Kilo that measures successful task completion across standardized real-world agent workflows.

The public PinchBench snapshot ranks anthropic/claude-opus-4.8-fast first at 93.5%, ahead of Qwen3.7 Max (92.5%) and Claude Opus 4.8 (90.5%) among 58 tested models. We mirror the table as display-only evidence; it does not affect overall rankings.

How BenchLM shows PinchBench

BenchLM mirrors the public PinchBench average-success-rate view using the official snapshot updated on 07/10/2026, 12:22 AM: 58 models and 614 runs. PinchBench grades runs with automated checks plus an LLM judge.

This benchmark is display only on BenchLM. It is excluded from BenchLM overall rankings, category rankings, and weighted scoring. The table below uses average scores only, matching the public PinchBench average view rather than the best-run view.

58 models614 runsAverage scores onlyOfficial runsDisplay only

PinchBench leaderboard PinchBench methodology Task list

Average success rate on PinchBench — 07/10/2026, 12:22 AM

BenchLM mirrors the published average success rate view for PinchBench. anthropic/claude-opus-4.8-fast leads the public snapshot at 93.5% , followed by Qwen3.7 Max (92.5%) and Claude Opus 4.8 (90.5%). BenchLM does not use these results to rank models overall.

anthropic/claude-opus-4.8-fast

Anthropic

93.5%

Overall —

2Closed

Qwen3.7 Max

Alibaba

qwen/qwen3.7-max

92.5%

Overall 72.84Context 1M

3Closed

Claude Opus 4.8

Anthropic

anthropic/claude-opus-4.8

90.5%

Overall 78.34Context 1M

58 modelsAgenticCurrentDisplay onlyUpdated 07/10/2026, 12:22 AM

Average success rate table (58 models)

Score

anthropic/claude-opus-4.8-fastAnthropic

93.5%

Qwen3.7 MaxAlibaba · Closed

92.5%

Claude Opus 4.8Anthropic · Closed

90.5%

nvidia/nemotron-3-ultra-550b-a55bnvidia

89.9%

MiMo-V2.5Xiaomi · Closed

89.7%

Grok Build 0.1xAI · Closed

88.9%

GPT-5.6 LunaOpenAI · Closed

88.7%

qwen/qwen3.6-flashAlibaba

88.1%

MiMo-V2.5-ProXiaomi · Closed

87.5%

GLM-5.2Z.AI · Open weight

87.0%

GPT-5.6 SolOpenAI · Closed

84.2%

inclusionai/ling-2.6-1tinclusionai

82.6%

DeepSeek V4 FlashDeepSeek · Open weight

81.7%

Gemini 3.1 ProGoogle · Closed

81.0%

Gemini 3.1 Flash-LiteGoogle · Closed

80.5%

Grok 4.20xAI · Closed

80.3%

Step 3.5 FlashStepFun · Open weight

79.4%

GPT-5.4 miniOpenAI · Closed

79.2%

Kimi K2.7 CodeMoonshot AI · Open weight

76.1%

Claude Opus 4.7Anthropic · Closed

76.0%

GPT-5.6 TerraOpenAI · Closed

75.9%

GPT-5.4OpenAI · Closed

75.7%

GPT-5.5OpenAI · Closed

75.5%

Grok 4.5xAI · Closed

75.2%

Seed-2.0-LiteByteDance · Closed

75.0%

Gemini 3.5 FlashGoogle · Closed

74.2%

Grok 4.3xAI · Closed

73.7%

Qwen3.6 PlusAlibaba · Closed

72.5%

Gemini 3 FlashGoogle · Closed

72.1%

GLM-5-TurboZ.AI · Closed

71.8%

Claude Opus 4.6Anthropic · Closed

69.9%

sakana/fugu-ultrasakana

69.5%

mistralai/devstral-2512Mistral

69.4%

GPT-5.4 nanoOpenAI · Closed

69.0%

Claude Haiku 4.5Anthropic · Closed

67.7%

GLM-5V-TurboZ.AI · Closed

67.6%

MiniMax M2.7MiniMax · Open weight

66.8%

Trinity-Large-PreviewArcee AI · Open weight

65.7%

mistralai/mistral-small-2603Mistral

64.6%

Claude Sonnet 4.6Anthropic · Closed

62.7%

aion-labs/aion-3.0aion-labs

61.1%

DeepSeek V4 ProDeepSeek · Open weight

61.1%

GLM-5.1Z.AI · Open weight

59.9%

google/gemma-4-26b-a4b-itGoogle

56.4%

Claude Fable 5Anthropic · Closed

54.8%

Kimi K2.5Moonshot AI · Open weight

54.6%

mistralai/mistral-large-2512Mistral

54.5%

Trinity-Large-PreviewArcee AI · Open weight

53.0%

google/gemma-4-31b-itGoogle

52.7%

anthropic/claude-sonnet-4Anthropic

48.8%

GPT-OSS 120BOpenAI · Open weight

44.8%

Nemotron 3 Super 120B A12BNVIDIA · Open weight

42.2%

Mercury 2Inception · Closed

39.6%

amazon/nova-2-lite-v1amazon

37.6%

GPT-OSS 20BOpenAI · Open weight

36.3%

GPT-5.5 ProOpenAI · Closed

21.4%

Llama 3.1 70B InstructMeta

10.7%

Llama 4 ScoutMeta · Open weight

3.2%

The published PinchBench snapshot places anthropic/claude-opus-4.8-fast first at 93.5%. The third row is 3.0 points behind. The broader top-10 range is 6.5 points, so many of the published results sit in a relatively narrow band.

58 models have been evaluated on PinchBench. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. PinchBench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About PinchBench

Year

2026

Tasks

23 OpenClaw agent tasks

Format

Average success rate from official runs

Difficulty

Long-horizon agent workflows

PinchBench publishes official OpenClaw runs across 23 tasks and grades results with automated checks plus an LLM judge. BenchLM mirrors the public average-score view as a display-only benchmark.

About PinchBench Public benchmark source

BenchLM freshness & provenance

Version

PinchBench 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does PinchBench measure?

An OpenClaw agent benchmark from Kilo that measures successful task completion across standardized real-world agent workflows.

Which model leads the published PinchBench snapshot?

anthropic/claude-opus-4.8-fast currently leads the published PinchBench snapshot with 93.5% average success rate. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on PinchBench?

58 AI models are included in BenchLM's mirrored PinchBench snapshot, based on the public leaderboard captured on 07/10/2026, 12:22 AM.

Last updated: 07/10/2026, 12:22 AM · mirrored from the public benchmark leaderboard

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.