According to BenchLM.ai, Sakana Fugu-Ultra leads the MRCRv2 benchmark with a score of 93.6%, followed by Qwen3.7 Plus (91.7%) and Qwen3.7 Max (90.4%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.

7 models have been evaluated on MRCRv2. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. Within that category, MRCRv2 contributes 31% of the category score, so strong performance here directly affects a model's overall ranking.

About MRCRv2

Year

2025

Tasks

Long-context retrieval

Format

Multi-round long-context evaluation

Difficulty

Hard long-context

MRCRv2 is especially useful for models that compete on long context, since it checks whether they can retrieve the right information across long, multi-round interactions.

Introducing GPT-5.2 and GPT-5.2 Pro

BenchLM freshness & provenance

Version

MRCRv2 2025

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

Current

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does MRCRv2 measure?

A long-context benchmark for memory, retrieval, and multi-round coherence over large contexts.

Which model scores highest on MRCRv2?

Sakana Fugu-Ultra by Sakana AI currently leads with a score of 93.6% on MRCRv2.

How many models are evaluated on MRCRv2?

7 AI models have been evaluated on MRCRv2 on BenchLM.

Compare Top Models on MRCRv2

Sakana Fugu-Ultra vs Qwen3.7 Plus Qwen3.7 Plus vs Qwen3.7 Max Qwen3.7 Max vs Sakana Fugu Sakana Fugu vs Gemini 3.5 Flash

Last updated: July 23, 2026 · BenchLM version MRCRv2 2025

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.