Benchmark profile

LisanBench

A word-chain reasoning benchmark that tests planning, recall, constraint following, and vocabulary depth by asking models to extend non-repeating edit-distance-1 chains.

How BenchLM shows LisanBench

BenchLM mirrors the current public LisanBench difficulty-weighted leaderboard using the official dataset published at lisanbench.com for the July 21, 2026 snapshot. The public benchmark tests 150 model variants across 50 starting words, with 3 trials per starting word.

LisanBench is a strong reasoning reference, but BenchLM currently keeps it display only rather than weighted. The public leaderboard is highly variant-specific, strongly English-vocabulary-dependent, and not yet aligned cleanly enough with BenchLM canonical model rows to use as a ranking input.

150 model variants50 starting words3 trials per wordDifficulty-weighted scoresDisplay only

LisanBench leaderboard LisanBench methodology Code and data

Difficulty-weighted score on LisanBench — July 21, 2026 snapshot

BenchLM mirrors the published difficulty-weighted score view for LisanBench. Opus 4.7 (xhigh) leads the public snapshot at 5122.60 , followed by Claude Fable 5 (medium) (4561.82) and Opus 4.6 (16k) (3526.49). BenchLM does not use these results to rank models overall.

1Closed

Opus 4.7 (xhigh)

Anthropic

anthropic/claude-opus-4.7:thinking-xhigh

5122.60

Overall 66.27Context 1M

2Closed

Claude Fable 5 (medium)

Anthropic

anthropic/claude-fable-5:thinking-medium

4561.82

Overall 83.68Context 1M+

Opus 4.6 (16k)

Anthropic

anthropic/claude-opus-4.6:thinking-16k

3526.49

Overall —

150 modelsReasoningCurrentDisplay onlyUpdated July 21, 2026 snapshot

Difficulty-weighted score table (150 models)

Score

Opus 4.7 (xhigh)Anthropic · Closed

5122.60

Claude Fable 5 (medium)Anthropic · Closed

4561.82

Opus 4.6 (16k)Anthropic

3526.49

GPT 5.5 (medium)OpenAI · Closed

3315.52

Sonnet 4.6 (16k)Anthropic

2944.27

GPT 5.4 (medium)OpenAI · Closed

2738.16

Opus 4.8 (high)Anthropic · Closed

2693.67

Opus 4.5 (16k)Anthropic

2204.43

Gemini 3.1 Pro Preview (high)Google · Closed

1929.11

Grok 4 (medium)xAI · Closed

1778.40

Sonnet 5 (high)Anthropic · Closed

1736.56

O3 (medium)OpenAI · Closed

1523.00

Deepseek V3.2 Speciale (thinking)DeepSeek

1510.70

Grok 4.20 Beta (thinking)xAI · Closed

1464.63

GPT 5.2 (medium)OpenAI · Closed

1458.80

GPT 5 (medium)OpenAI · Closed

1457.28

GPT 5.6 Sol (medium)OpenAI · Closed

1403.28

GPT 5.6 Terra (medium)OpenAI · Closed

1196.90

Gemini 3 Pro Preview (high)Google · Closed

1130.63

Gemini 3.5 Flash (high)Google · Closed

1128.26

Sonnet 4.5 (16k)Anthropic

1090.82

Deepseek V4 Flash (high)DeepSeek · Open weight

1063.47

Deepseek V4 Pro (high)DeepSeek · Open weight

1059.51

Deepseek V3.2 (thinking)DeepSeek · Open weight

925.31

Gemini 3.1 Pro Preview (low)Google · Closed

872.67

Step 3.5 Flash (thinking)StepFun · Open weight

811.21

Grok 4 Fast (thinking)xAI

806.48

GPT 5 Mini (medium)OpenAI · Closed

758.84

GPT 5.6 Luna (medium)OpenAI · Closed

648.18

Kimi K2.5 (thinking)Moonshot AI · Closed

641.96

Kimi K2 (thinking)Moonshot AI · Closed

633.05

GPT 5 Nano (medium)OpenAI · Closed

626.86

Grok 4.1 Fast (thinking)xAI · Closed

604.43

Sonnet 4 (16k)Anthropic

602.95

GLM 5.2 (high)Z.AI · Open weight

591.99

Gemini 3 Flash Preview (high)Google · Closed

591.88

GPT 5.4 Mini (medium)OpenAI · Closed

591.48

GPT 5.4 Nano (medium)OpenAI · Closed

543.21

O3 Mini (medium)OpenAI · Closed

518.37

Doubao Seed 2.0 Pro (thinking)StepFun

456.93

GPT-OSS-120B (medium)OpenAI · Open weight

448.33

Qwen3.5 397B A17B (thinking)Alibaba · Open weight

387.74

O4 Mini (medium)OpenAI · Closed

352.68

GLM 5 (thinking)Z.AI · Open weight

336.34

GPT 5.6 SolOpenAI

332.46

GPT 5.5OpenAI

305.30

Opus 4.8Anthropic · Closed

270.10

Doubao Seed 2.0 Lite (thinking)StepFun

265.12

Opus 4Anthropic

262.56

Doubao Seed 1.8 (thinking)StepFun

255.84

Minimax M2.5 (thinking)MiniMax · Closed

228.38

Qwen3 235B A22B 2507 (thinking)Alibaba · Open weight

226.22

Opus 4.7Anthropic · Closed

217.52

Opus 4.1Anthropic · Closed

215.66

Sonnet 4.6Anthropic · Closed

208.64

Sonnet 5Anthropic

208.26

Gemini 2.5 Pro (16k)Google · Closed

197.43

Grok 3 Mini (medium)xAI · Closed

194.03

Grok 3 (thinking)xAI · Closed

188.12

Sonnet 3.7Anthropic

162.44

GPT-OSS-20B (medium)OpenAI · Open weight

156.99

GPT 5.6 TerraOpenAI

156.14

Doubao Seed 2.0 Mini (thinking)StepFun

150.35

Sonnet 4Anthropic · Closed

150.17

Sonnet 3.6Anthropic · Closed

149.65

Sonnet 3.5Anthropic · Closed

129.73

Gemini Pro 1.5Google · Closed

119.72

Deepseek V3.2DeepSeek · Open weight

119.04

Deepseek V4 ProDeepSeek · Open weight

117.14

Gemini 2.5 Flash (16k)Google

112.75

Deepseek R1 0528 (thinking)DeepSeek · Open weight

111.60

Qwen3.5 122B A10B (thinking)Alibaba · Open weight

109.62

GPT 5.4OpenAI · Closed

109.51

GLM 4.5 (thinking)Z.AI · Closed

108.32

Qwen3.5 35B A3B (thinking)Alibaba · Open weight

107.61

Olmo 3 32B (thinking)Allen AI

104.86

Sonnet 4.5Anthropic · Closed

103.58

Deepseek V3DeepSeek · Open weight

103.39

O1 Mini (medium)OpenAI

103.10

GPT 5.6 LunaOpenAI

96.63

GPT 4oOpenAI · Closed

94.16

Opus 4.5Anthropic · Closed

93.49

Opus 4.6Anthropic · Closed

91.61

GPT 4 TurboOpenAI · Closed

91.48

Kimi K2Moonshot AI · Closed

85.92

Qwen3 4B (16k)Alibaba

77.21

Opus 3Anthropic · Closed

75.77

Gemini 2.5 FlashGoogle · Closed

72.17

Minimax M1 (thinking)MiniMax · Closed

66.71

GLM 5.2Z.AI

64.50

Gemini 2.0 FlashGoogle

62.04

Deepseek V4 FlashDeepSeek · Open weight

56.78

Horizon BetaOpenRouter

55.69

Gemini Flash 1.5Google

55.20

GLM 4.5 Air (thinking)Z.AI · Closed

54.86

Nova Pro V1Amazon · Closed

54.38

GLM 4.7 (thinking)Z.AI · Open weight

54.27

Polaris AlphaOpenRouter

53.34

Haiku 4.5Anthropic · Closed

52.64

100

GPT 3.5 TurboOpenAI

50.96

101

Qwen3 CoderAlibaba

50.95

102

Llama 3.1 405BMeta

48.88

103

Grok 4.1 FastxAI · Closed

47.29

104

GLM 4.6 (thinking)Z.AI · Open weight

44.20

105

Gemma 3 27BGoogle

43.79

106

Mistral Medium 3Mistral · Closed

43.31

107

GPT 5.4 MiniOpenAI

42.74

108

Llama 4 MaverickMeta · Open weight

42.33

109

GPT 4.1OpenAI · Closed

42.02

110

Ernie 4.5 300B A47BBaidu

40.99

111

Sherlock Dash AlphaOpenRouter

40.07

112

Devstral MediumMistral AI

40.03

113

Gemini 2.0 Flash Lite 001Google

38.77

114

Haiku 3.5Anthropic

38.22

115

Gemini 2.5 Flash LiteGoogle

38.21

116

Llama 3.1 70BMeta

38.06

117

Qwen3 1.7B (16k)Alibaba

38.06

118

Haiku 3Anthropic · Closed

35.46

119

Mistral Large 2411Mistral AI

34.64

120

GPT 4.1 MiniOpenAI · Closed

32.82

121

Gemini 2.5 Flash Lite (16k)Google

32.37

122

Qwen3 235B A22B 2507Alibaba

31.92

123

Llama 4 ScoutMeta · Open weight

30.85

124

Mimo V2 Flash (thinking)Xiaomi · Open weight

28.75

125

Nova Lite V1Amazon

25.24

126

Nova Micro V1Amazon

24.88

127

Gemini Flash 1.5 8BGoogle

24.38

128

Qwen3 32BAlibaba

23.86

129

Gemma 3 12BGoogle

21.56

130

GPT 4o MiniOpenAI · Closed

21.21

131

Qwen3 30B A3B 2507Alibaba

18.98

132

Mistral Small 3.2 24BMistral AI

17.77

133

Qwen3 14BAlibaba

16.68

134

Qwen3 8BAlibaba

15.50

135

GPT 4.1 NanoOpenAI · Closed

14.95

136

Devstral SmallMistral AI

13.22

137

Codestral 2508Mistral AI

13.15

138

Ministral 14B 2512Mistral AI

12.29

139

Ministral 8B 2512Mistral AI

11.78

140

Mistral NemoMistral AI

11.47

141

GPT 5.4 NanoOpenAI

11.16

142

Gemma 3 4BGoogle

9.41

143

Qwen3 0.6B (16k)Alibaba

9.24

144

Qwen3 4BAlibaba

7.90

145

Ministral 3B 2512Mistral AI

6.64

146

Qwen3 1.7BAlibaba

6.27

147

Llama 3.1 8BMeta

3.92

148

Llama 3.2 3BMeta

2.85

149

Llama 3.2 1BMeta

0.63

150

Qwen3 0.6BAlibaba

0.06

The published LisanBench snapshot places Opus 4.7 (xhigh) first at 5122.60. The third row is 1596.11 score units behind. The broader top-10 range is 3344.20 score units, so the table still separates the published systems.

150 models have been evaluated on LisanBench. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. LisanBench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About LisanBench

Year

2026

Tasks

50 starting words × 3 trials

Format

Difficulty-weighted word-chain reasoning

Difficulty

Open-ended lexical planning

BenchLM mirrors the public difficulty-weighted LisanBench leaderboard as a display-only reasoning benchmark. The public benchmark currently evaluates 128 model variants across 50 starting words with 3 trials per word.

LisanBench methodology Public benchmark source

BenchLM freshness & provenance

Version

LisanBench 2026

Refresh cadence

Static

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does LisanBench measure?

A word-chain reasoning benchmark that tests planning, recall, constraint following, and vocabulary depth by asking models to extend non-repeating edit-distance-1 chains.

Which model leads the published LisanBench snapshot?

Opus 4.7 (xhigh) currently leads the published LisanBench snapshot with 5122.60 difficulty-weighted score. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on LisanBench?

150 AI models are included in BenchLM's mirrored LisanBench snapshot, based on the public leaderboard captured on July 21, 2026 snapshot.

Last updated: July 21, 2026 snapshot · mirrored from the public benchmark leaderboard

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.