Benchmark profile

MCP Atlas

A benchmark for tool-calling over Model Context Protocol integrations and external tools.

Data verified July 23, 2026

Benchmark score on MCP Atlas — July 23, 2026

BenchLM mirrors the published score view for MCP Atlas. Muse Spark 1.1 leads the public snapshot at 88.1% , followed by Kimi K3 (84.2%) and Gemini 3.5 Flash (83.6%). BenchLM does not use these results to rank models overall.

1Closed

Muse Spark 1.1

Benchmark score table (29 models)

Score

Muse Spark 1.1Meta · Closed

88.1%

Kimi K3Moonshot AI · Closed

84.2%

Gemini 3.5 FlashGoogle · Closed

83.6%

Claude Opus 4.8Anthropic · Closed

82.2%

Claude Opus 4.7 (Adaptive)Anthropic · Closed

77.3%

GLM-5.2Z.AI · Open weight

76.8%

Qwen3.7 MaxAlibaba · Closed

76.4%

Kimi K2.7 CodeMoonshot AI · Open weight

76%

GPT-5.5OpenAI · Closed

75.3%

DeepSeek V4 Pro (High)DeepSeek · Open weight

74.2%

MiniMax M3MiniMax · Open weight

74.2%

InklingThinking Machines Lab · Open weight

74.1%

DeepSeek V4 Pro (Max)DeepSeek · Open weight

73.6%

Qwen3.7 PlusAlibaba · Closed

73.2%

GLM-5.1Z.AI · Open weight

71.8%

GPT-5.4OpenAI · Closed

70.6%

DeepSeek V4 ProDeepSeek · Open weight

69.4%

DeepSeek V4 Flash (Max)DeepSeek · Open weight

69%

DeepSeek V4 Flash (High)DeepSeek · Open weight

67.4%

DeepSeek V4 FlashDeepSeek · Open weight

64%

Qwen3.6-35B-A3BAlibaba · Open weight

62.8%

GPT-5.4 miniOpenAI · Closed

57.7%

GPT-5.4 nanoOpenAI · Closed

56.1%

Kimi K2.6Moonshot AI · Open weight

55.9%

Qwen3.6 PlusAlibaba · Closed

48.2%

Qwen3.5 397BAlibaba · Open weight

46.1%

Claude Opus 4.5Anthropic · Closed

42.3%

GLM-5Z.AI · Open weight

31.1%

Kimi K2.5Moonshot AI · Open weight

29.5%

The published MCP Atlas snapshot places Muse Spark 1.1 first at 88.1%. The third row is 4.5 points behind. The broader top-10 range is 13.9 points, so the table still separates the published systems.

29 models have been evaluated on MCP Atlas. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. MCP Atlas is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About MCP Atlas

Year

2026

Tasks

Tool-integrated agent tasks

Format

Interactive tool-calling evaluation

Difficulty

Advanced tool use

OpenAI reports MCP Atlas as a tool-use benchmark that measures how well models work with MCP-backed systems and external tools.

Introducing GPT-5.4 mini and nano

BenchLM freshness & provenance

Version

MCP Atlas 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does MCP Atlas measure?

A benchmark for tool-calling over Model Context Protocol integrations and external tools.

Which model scores highest on MCP Atlas?

Muse Spark 1.1 by Meta currently leads with a score of 88.1% on MCP Atlas.

How many models are evaluated on MCP Atlas?

29 AI models have been evaluated on MCP Atlas on BenchLM.

Compare Top Models on MCP Atlas

Muse Spark 1.1 vs Kimi K3 Kimi K3 vs Gemini 3.5 Flash Gemini 3.5 Flash vs Claude Opus 4.8 Claude Opus 4.8 vs Claude Opus 4.7 (Adaptive)

Last updated: July 23, 2026 · BenchLM version MCP Atlas 2026

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.