A long-context benchmark for memory, retrieval, and multi-round coherence over large contexts.
As of June 2, 2026, Qwen3.7 Max leads the MRCRv2 leaderboard with 90.4% , followed by Gemini 3.5 Flash (77.3%).
Qwen3.7 Max
Alibaba
Gemini 3.5 Flash
Year
2025
Tasks
Long-context retrieval
Format
Multi-round long-context evaluation
Difficulty
Hard long-context
MRCRv2 is especially useful for models that compete on long context, since it checks whether they can retrieve the right information across long, multi-round interactions.
Version
MRCRv2 2025
Refresh cadence
Quarterly
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A long-context benchmark for memory, retrieval, and multi-round coherence over large contexts.
Qwen3.7 Max by Alibaba currently leads with a score of 90.4% on MRCRv2.
2 AI models have been evaluated on MRCRv2 on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.