Skip to main content

LisanBench

A word-chain reasoning benchmark that tests planning, recall, constraint following, and vocabulary depth by asking models to extend non-repeating edit-distance-1 chains.

How BenchLM shows LisanBench

BenchLM mirrors the current public LisanBench difficulty-weighted leaderboard using the official dataset published at lisanbench.com for the June 2, 2026 snapshot. The public benchmark tests 139 model variants across 50 starting words, with 3 trials per starting word.

LisanBench is a strong reasoning reference, but BenchLM currently keeps it display only rather than weighted. The public leaderboard is highly variant-specific, strongly English-vocabulary-dependent, and not yet aligned cleanly enough with BenchLM canonical model rows to use as a ranking input.

139 model variants50 starting words3 trials per wordDifficulty-weighted scoresDisplay only

Difficulty-weighted score on LisanBench — June 2, 2026 snapshot

BenchLM mirrors the published difficulty-weighted score view for LisanBench. Claude Opus 4.7 leads the public snapshot at 5122.60 , followed by Opus 4.6 (16k) (3526.49) and GPT-5.5 (3315.52). BenchLM does not use these results to rank models overall.

139 modelsReasoningCurrentDisplay onlyUpdated June 2, 2026 snapshot

The published LisanBench snapshot is tightly clustered at the top: Claude Opus 4.7 sits at 5122.60, while the third row is only 1807.08 points behind. The broader top-10 spread is 3599.60 points, so the benchmark still separates strong models even when the leaders cluster.

139 models have been evaluated on LisanBench. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. LisanBench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About LisanBench

Year

2026

Tasks

50 starting words × 3 trials

Format

Difficulty-weighted word-chain reasoning

Difficulty

Open-ended lexical planning

BenchLM mirrors the public difficulty-weighted LisanBench leaderboard as a display-only reasoning benchmark. The public benchmark currently evaluates 128 model variants across 50 starting words with 3 trials per word.

BenchLM freshness & provenance

Version

LisanBench 2026

Refresh cadence

Static

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Difficulty-weighted score table (139 models)

1
Claude Opus 4.7anthropic/claude-opus-4.7:thinking-xhigh
5122.60
2
Opus 4.6 (16k)anthropic/claude-opus-4.6:thinking-16k
3526.49
3
GPT-5.5openai/gpt-5.5:thinking-medium
3315.52
4
Sonnet 4.6 (16k)anthropic/claude-sonnet-4.6:thinking-16k
2944.27
5
GPT 5.4 (medium)openai/gpt-5.4:thinking-medium
2738.16
6
Claude Opus 4.8anthropic/claude-opus-4.8:thinking-high
2693.67
7
Opus 4.5 (16k)anthropic/claude-opus-4.5:thinking-16k
2204.43
8
Gemini 3.1 Pro Preview (high)google/gemini-3.1-pro-preview:thinking-high
1929.11
9
Grok 4 (medium)x-ai/grok-4:thinking-medium
1778.40
10
O3 (medium)openai/o3:thinking-medium
1523.00
11
Deepseek V3.2 Speciale (thinking)deepseek/deepseek-v3.2-speciale:thinking
1510.70
12
Grok 4.20 Beta (thinking)x-ai/grok-4.20-beta:thinking
1464.63
13
GPT 5.2 (medium)openai/gpt-5.2:thinking-medium
1458.80
14
GPT 5 (medium)openai/gpt-5
1457.28
15
Gemini 3 Pro Preview (high)google/gemini-3-pro-preview
1130.63
16
Gemini 3.5 Flashgoogle/gemini-3.5-flash:thinking-high
1128.26
17
Sonnet 4.5 (16k)anthropic/claude-sonnet-4.5:thinking-16k
1090.82
18
DeepSeek V4 Flash (High)zenmux/deepseek-v4-flash:thinking-high
1063.47
19
DeepSeek V4 Pro (High)zenmux/deepseek-v4-pro:thinking-high
1059.51
20
Deepseek V3.2 (thinking)deepseek/deepseek-v3.2:thinking
925.31
21
Gemini 3.1 Pro Preview (low)google/gemini-3.1-pro-preview:thinking-low
872.67
22
Step 3.5 Flash (thinking)zenmux/step-3.5-flash:thinking
811.21
23
Grok 4 Fast (thinking)x-ai/grok-4-fast:free
806.48
24
GPT 5 Mini (medium)openai/gpt-5-mini
758.84
25
Kimi K2.5 (thinking)moonshotai/kimi-k2.5:thinking
641.96
26
Kimi K2 (thinking)moonshotai/kimi-k2-thinking
633.05
27
GPT 5 Nano (medium)openai/gpt-5-nano
626.86
28
Grok 4.1 Fast (thinking)x-ai/grok-4.1-fast:thinking
604.43
29
Sonnet 4 (16k)anthropic/claude-sonnet-4:thinking-16k
602.95
30
Gemini 3 Flash Preview (high)google/gemini-3-flash-preview
591.88
31
GPT 5.4 Mini (medium)openai/gpt-5.4-mini:thinking-medium
591.48
32
GPT 5.4 Nano (medium)openai/gpt-5.4-nano:thinking-medium
543.21
33
O3 Mini (medium)openai/o3-mini
518.37
34
Doubao Seed 2.0 Pro (thinking)zenmux/doubao-seed-2.0-pro:thinking
456.93
35
GPT-OSS-120B (medium)openai/gpt-oss-120b
448.33
36
Qwen3.5 397B A17B (thinking)qwen/qwen3.5-397b-a17b:thinking
387.74
37
O4 Mini (medium)openai/o4-mini
352.68
38
GLM 5 (thinking)z-ai/glm-5:thinking
336.34
39
GPT 5.5openai/gpt-5.5:thinking-none
305.30
40
Claude Opus 4.8anthropic/claude-opus-4.8
270.10
41
Doubao Seed 2.0 Lite (thinking)zenmux/doubao-seed-2.0-lite:thinking
265.12
42
Opus 4anthropic/claude-opus-4
262.56
43
Doubao Seed 1.8 (thinking)zenmux/doubao-seed-1.8:thinking
255.84
44
Minimax M2.5 (thinking)minimax/minimax-m2.5:thinking
228.38
45
Qwen3 235B A22B 2507 (thinking)qwen/qwen3-235b-a22b-thinking-2507
226.22
46
Claude Opus 4.7anthropic/claude-opus-4.7
217.52
47
Opus 4.1anthropic/claude-opus-4.1
215.66
48
Sonnet 4.6anthropic/claude-sonnet-4.6
208.64
49
Gemini 2.5 Pro (16k)google/gemini-2.5-pro:thinking-16k
197.43
50
Grok 3 Mini (medium)x-ai/grok-3-mini:thinking-medium
194.03
51
188.12
52
Sonnet 3.7anthropic/claude-3.7-sonnet
162.44
53
GPT-OSS-20B (medium)openai/gpt-oss-20b
156.99
54
Doubao Seed 2.0 Mini (thinking)zenmux/doubao-seed-2.0-mini:thinking
150.35
55
Sonnet 4anthropic/claude-sonnet-4
150.17
56
Sonnet 3.6anthropic/claude-3.5-sonnet
149.65
57
Sonnet 3.5anthropic/claude-3.5-sonnet-20240620
129.73
58
Gemini Pro 1.5google/gemini-pro-1.5
119.72
59
Deepseek V3.2deepseek/deepseek-v3.2
119.04
60
DeepSeek V4 Prozenmux/deepseek-v4-pro
117.14
61
Gemini 2.5 Flash (16k)google/gemini-2.5-flash:thinking-16k
112.75
62
Deepseek R1 0528 (thinking)deepseek/deepseek-r1-0528
111.60
63
Qwen3.5 122B A10B (thinking)qwen/qwen3.5-122b-a10b:thinking
109.62
64
GPT 5.4openai/gpt-5.4:thinking-none
109.51
65
GLM 4.5 (thinking)z-ai/glm-4.5
108.32
66
Qwen3.5 35B A3B (thinking)qwen/qwen3.5-35b-a3b:thinking
107.61
67
Olmo 3 32B (thinking)allenai/olmo-3-32b-think
104.86
68
Sonnet 4.5anthropic/claude-sonnet-4.5
103.58
69
Deepseek V3deepseek/deepseek-chat
103.39
70
O1 Mini (medium)openai/o1-mini
103.10
71
GPT 4oopenai/chatgpt-4o-latest
94.16
72
Opus 4.5anthropic/claude-opus-4.5
93.49
73
Opus 4.6anthropic/claude-opus-4.6
91.61
74
GPT 4 Turboopenai/gpt-4-turbo
91.48
75
Kimi K2moonshotai/kimi-k2
85.92
76
Qwen3 4B (16k)Qwen/Qwen3-4B-FP8:thinking-16k
77.21
77
Opus 3anthropic/claude-3-opus
75.77
78
Gemini 2.5 Flashgoogle/gemini-2.5-flash
72.17
79
Minimax M1 (thinking)minimax/minimax-m1
66.71
80
Gemini 2.0 Flashgoogle/gemini-2.0-flash-001
62.04
81
DeepSeek V4 Flashzenmux/deepseek-v4-flash
56.78
82
Horizon Betaopenrouter/horizon-beta
55.69
83
Gemini Flash 1.5google/gemini-flash-1.5
55.20
84
GLM 4.5 Air (thinking)z-ai/glm-4.5-air
54.86
85
Nova Pro V1amazon/nova-pro-v1
54.38
86
GLM 4.7 (thinking)z-ai/glm-4.7
54.27
87
Polaris Alphaopenrouter/polaris-alpha
53.34
88
Haiku 4.5anthropic/claude-haiku-4.5
52.64
89
GPT 3.5 Turboopenai/gpt-3.5-turbo-0613
50.96
90
Qwen3 Coderqwen/qwen3-coder
50.95
91
Llama 3.1 405Bmeta-llama/llama-3.1-405b-instruct
48.88
92
Grok 4.1 Fastx-ai/grok-4.1-fast
47.29
93
GLM 4.6 (thinking)z-ai/glm-4.6
44.20
94
Gemma 3 27Bgoogle/gemma-3-27b-it
43.79
95
Mistral Medium 3mistralai/mistral-medium-3
43.31
96
GPT 5.4 Miniopenai/gpt-5.4-mini:thinking-none
42.74
97
Llama 4 Maverickmeta-llama/llama-4-maverick
42.33
98
GPT 4.1openai/gpt-4.1
42.02
99
Ernie 4.5 300B A47Bbaidu/ernie-4.5-300b-a47b
40.99
100
Sherlock Dash Alphaopenrouter/sherlock-dash-alpha
40.07
101
Devstral Mediummistralai/devstral-medium
40.03
102
Gemini 2.0 Flash Lite 001google/gemini-2.0-flash-lite-001
38.77
103
Haiku 3.5anthropic/claude-3-5-haiku-20241022
38.22
104
Gemini 2.5 Flash Litegoogle/gemini-2.5-flash-lite
38.21
105
Llama 3.1 70Bmeta-llama/llama-3.1-70b-instruct
38.06
106
Qwen3 1.7B (16k)Qwen/Qwen3-1.7B-FP8:thinking-16k
38.06
107
Haiku 3anthropic/claude-3-haiku
35.46
108
Mistral Large 2411mistralai/mistral-large-2411
34.64
109
GPT 4.1 Miniopenai/gpt-4.1-mini
32.82
110
Gemini 2.5 Flash Lite (16k)google/gemini-2.5-flash-lite:thinking-16k
32.37
111
Qwen3 235B A22B 2507qwen/qwen3-235b-a22b-2507
31.92
112
Llama 4 Scoutmeta-llama/llama-4-scout
30.85
113
Mimo V2 Flash (thinking)zenmux/mimo-v2-flash:thinking
28.75
114
Nova Lite V1amazon/nova-lite-v1
25.24
115
Nova Micro V1amazon/nova-micro-v1
24.88
116
Gemini Flash 1.5 8Bgoogle/gemini-flash-1.5-8b
24.38
117
Qwen3 32Bqwen/qwen3-32b
23.86
118
Gemma 3 12Bgoogle/gemma-3-12b-it
21.56
119
GPT 4o Miniopenai/gpt-4o-mini
21.21
120
Qwen3 30B A3B 2507qwen/qwen3-30b-a3b-instruct-2507
18.98
121
Mistral Small 3.2 24Bmistralai/mistral-small-3.2-24b-instruct
17.77
122
Qwen3 14Bqwen/qwen3-14b
16.68
123
Qwen3 8BQwen/Qwen3-8B-FP8
15.50
124
GPT 4.1 Nanoopenai/gpt-4.1-nano
14.95
125
Devstral Smallmistralai/devstral-small
13.22
126
Codestral 2508mistralai/codestral-2508
13.15
127
Ministral 14B 2512mistralai/ministral-14b-2512
12.29
128
Ministral 8B 2512mistralai/ministral-8b-2512
11.78
129
Mistral Nemomistralai/mistral-nemo
11.47
130
GPT 5.4 Nanoopenai/gpt-5.4-nano:thinking-none
11.16
131
Gemma 3 4Bgoogle/gemma-3-4b-it
9.41
132
Qwen3 0.6B (16k)Qwen/Qwen3-0.6B-FP8:thinking-16k
9.24
133
Qwen3 4BQwen/Qwen3-4B-FP8
7.90
134
Ministral 3B 2512mistralai/ministral-3b-2512
6.64
135
Qwen3 1.7BQwen/Qwen3-1.7B-FP8
6.27
136
Llama 3.1 8Bmeta-llama/llama-3.1-8b-instruct
3.92
137
Llama 3.2 3Bmeta-llama/llama-3.2-3b-instruct
2.85
138
Llama 3.2 1Bmeta-llama/llama-3.2-1b-instruct
0.63
139
Qwen3 0.6BQwen/Qwen3-0.6B-FP8
0.06

FAQ

What does LisanBench measure?

A word-chain reasoning benchmark that tests planning, recall, constraint following, and vocabulary depth by asking models to extend non-repeating edit-distance-1 chains.

Which model leads the published LisanBench snapshot?

Claude Opus 4.7 currently leads the published LisanBench snapshot with 5122.60 difficulty-weighted score. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on LisanBench?

139 AI models are included in BenchLM's mirrored LisanBench snapshot, based on the public leaderboard captured on June 2, 2026 snapshot.

Last updated: June 2, 2026 snapshot · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.