Benchmark profile

DeepSearchQA

An agentic browsing benchmark where models search the web, gather evidence, and answer list-style questions using browser tools.

Data verified July 23, 2026

Benchmark score on DeepSearchQA — July 23, 2026

BenchLM mirrors the published score view for DeepSearchQA. Kimi K3 leads the public snapshot at 95.0% , followed by Claude Opus 4.8 (93.1%) and Step 3.7 Flash (92.8%). BenchLM does not use these results to rank models overall.

1Closed

Kimi K3

Moonshot AI

kimi-3

95.0%

Overall 80.96Context 1.05M

2Closed

Claude Opus 4.8

Anthropic

claude-opus-4-8

93.1%

Overall 78.34Context 1M

3Open

Step 3.7 Flash

StepFun

step-3-7-flash

92.8%

Overall 50.87Context 256K

11 modelsAgenticCurrentDisplay onlyUpdated July 23, 2026

Benchmark score table (11 models)

Score

Kimi K3Moonshot AI · Closed

95.0%

Claude Opus 4.8Anthropic · Closed

93.1%

Step 3.7 FlashStepFun · Open weight

92.8%

Kimi K2.6Moonshot AI · Open weight

92.5%

Muse Spark 1.1Meta · Closed

84.9%

Kimi K2.5Moonshot AI · Open weight

77.1%

Muse SparkMeta · Closed

74.8%

Claude Opus 4.6Anthropic · Closed

73.7%

GPT-5.4OpenAI · Closed

73.6%

Gemini 3.1 ProGoogle · Closed

69.7%

Grok 4.20xAI · Closed

62.8%

The published DeepSearchQA snapshot places Kimi K3 first at 95.0%. The third row is 2.2 points behind. The broader top-10 range is 25.3 points, so the table still separates the published systems.

11 models have been evaluated on DeepSearchQA. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. DeepSearchQA is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About DeepSearchQA

Year

2026

Tasks

Agentic browsing and list-answer questions

Format

Search / open / find browser-agent evaluation

Difficulty

Agentic web research

Meta describes DeepSearchQA as a browser-tool evaluation graded with an F1-style semantic set match. BenchLM stores it as a display-only agentic search benchmark.

Muse Spark Eval Methodology

BenchLM freshness & provenance

Version

DeepSearchQA 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does DeepSearchQA measure?

An agentic browsing benchmark where models search the web, gather evidence, and answer list-style questions using browser tools.

Which model scores highest on DeepSearchQA?

Kimi K3 by Moonshot AI currently leads with a score of 95.0% on DeepSearchQA.

How many models are evaluated on DeepSearchQA?

11 AI models have been evaluated on DeepSearchQA on BenchLM.

Compare Top Models on DeepSearchQA

Kimi K3 vs Claude Opus 4.8 Claude Opus 4.8 vs Step 3.7 Flash Step 3.7 Flash vs Kimi K2.6 Kimi K2.6 vs Muse Spark 1.1

Last updated: July 23, 2026 · BenchLM version DeepSearchQA 2026

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.