Benchmark profile

Vals CaseLaw v2 (CaseLaw v2)

Vals AI private question-answer benchmark over Canadian court cases.

Data verified May 4, 2026

How BenchLM shows CaseLaw v2

BenchLM mirrors the public Vals AI CaseLaw v2 leaderboard captured from https://www.vals.ai/benchmarks/case_law_v2 and updated by Vals on May 4, 2026. The snapshot preserves overall scores, uncertainty, latency, cost-per-test metadata, and task-level scores where Vals publishes them.

CaseLaw v2 is display only on BenchLM. Vals proprietary or Vals-hosted aggregate views are useful context, but BenchLM does not use them as weighted ranking inputs or as a replacement for benchmark-native source records.

54 Vals rows1 task viewsprivate datasetTasks: OverallDisplay only

CaseLaw v2 on Vals AI Vals methodology Vals home

CaseLaw v2 score on CaseLaw v2 — May 4, 2026

BenchLM mirrors the published caselaw v2 score view for CaseLaw v2. Grok 4.3 leads the public snapshot at 79.31% , followed by GPT-5.1 (73.42%) and GPT-4.1 (69.88%). BenchLM does not use these results to rank models overall.

Grok 4.3

SpaceXAI

grok/grok-4.3

79.31%

Overall —

GPT-5.1

OpenAI

openai/gpt-5.1-2025-11-13

73.42%

Overall —

GPT-4.1

OpenAI

openai/gpt-4.1-2025-04-14

69.88%

Overall —

54 modelsExternal benchmark mirrorsCurrentDisplay onlyUpdated May 4, 2026

CaseLaw v2 score table (54 models)

Score

Grok 4.3SpaceXAI

79.31%

GPT-5.1OpenAI

73.42%

GPT-4.1OpenAI

69.88%

GPT-5 MiniOpenAI

68.49%

Claude Opus 4.7Anthropic

68.38%

GPT-5OpenAI

66.45%

GPT-5.5OpenAI

66.24%

GPT-5.2OpenAI

66.02%

Grok 4 0709SpaceXAI

65.81%

Grok 4 Fast ReasoningSpaceXAI

65.70%

Kimi K2 ThinkingMoonshot AI

65.70%

Gemini 3.1 Pro PreviewGoogle

64.84%

Command A 03 2025Cohere

64.52%

Claude Sonnet 4.6Anthropic

63.99%

Gemini 2.5 ProGoogle

63.88%

GPT-5.4OpenAI

63.77%

Muse SparkMeta

63.13%

Claude Opus 4.5 20251101 ThinkingAnthropic

62.59%

Claude Sonnet 4.5 20250929 ThinkingAnthropic

62.16%

Claude Opus 4.6 ThinkingAnthropic

62.06%

Mistral Large 2512Mistral AI

61.41%

Kimi K2.6Moonshot AI

61.20%

MiniMax M2.7MiniMax

60.88%

Grok 4.1 Fast ReasoningSpaceXAI

60.45%

GPT-4oOpenAI

59.70%

Qwen3.5 Plus ThinkingAlibaba

59.70%

DeepSeek V4 ProDeepSeek

59.38%

Kimi K2.5 ThinkingMoonshot AI

58.73%

Trinity Large ThinkingArcee-Ai

57.88%

Claude Haiku 4.5 20251001 ThinkingAnthropic

56.48%

Qwen3.5 FlashAlibaba

55.95%

MiniMax M2.1MiniMax

55.84%

Gemini 3 Flash PreviewGoogle

55.84%

DeepSeek V3p2 ThinkingFireworks AI

55.41%

Gemini 3.1 Flash Lite PreviewGoogle

54.98%

Qwen3 MaxAlibaba

54.98%

GLM 4.7Zhipu AI

54.88%

Grok 4.20 0309 ReasoningSpaceXAI

54.45%

DeepSeek V3p1Fireworks AI

53.91%

MiniMax M2.5MiniMax

53.48%

Qwen3.6 27bAlibaba

53.16%

Gemini 3 Pro PreviewGoogle

53.05%

Gemma 4 31b ItGoogle

52.63%

GPT-5 NanoOpenAI

52.63%

GLM 5 ThinkingZhipu AI

52.52%

GPT-5.4 NanoOpenAI

51.88%

GPT-5.4 MiniOpenAI

51.66%

GLM 5.1Zhipu AI

51.55%

Qwen3.6 PlusAlibaba

51.45%

GPT Oss 120bFireworks AI

48.77%

Qwen3.6 Max PreviewAlibaba

47.91%

Qwen3 MaxAlibaba

47.48%

Mistral Medium 3.5Mistral AI

44.16%

GPT Oss 20bFireworks AI

43.84%

The published CaseLaw v2 snapshot places Grok 4.3 first at 79.31%. The third row is 9.43 points behind. The broader top-10 range is 13.61 points, so the table still separates the published systems.

54 models have been evaluated on CaseLaw v2. The benchmark falls in the External benchmark mirrors category. We keep external benchmark mirrors separate from the weighted global scoring system, so these results remain source-specific evidence. CaseLaw v2 is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About CaseLaw v2

Year

2026

Tasks

Canadian case-law question answering

Format

Accuracy score

Difficulty

Professional legal retrieval and reasoning

Vals marks CaseLaw v2 as archived. BenchLM mirrors the public leaderboard as display-only historical legal-domain context.

CaseLaw v2 Public benchmark source

BenchLM freshness & provenance

Version

CaseLaw v2 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does CaseLaw v2 measure?

Vals AI private question-answer benchmark over Canadian court cases.

Which model leads the published CaseLaw v2 snapshot?

Grok 4.3 currently leads the published CaseLaw v2 snapshot with 79.31% caselaw v2 score. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on CaseLaw v2?

54 AI models are included in BenchLM's mirrored CaseLaw v2 snapshot, based on the public leaderboard captured on May 4, 2026.

Last updated: May 4, 2026 · mirrored from the public benchmark leaderboard

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.