A word-chain reasoning benchmark that tests planning, recall, constraint following, and vocabulary depth by asking models to extend non-repeating edit-distance-1 chains.
BenchLM mirrors the current public LisanBench difficulty-weighted leaderboard using the official dataset published at lisanbench.com for the June 2, 2026 snapshot. The public benchmark tests 139 model variants across 50 starting words, with 3 trials per starting word.
LisanBench is a strong reasoning reference, but BenchLM currently keeps it display only rather than weighted. The public leaderboard is highly variant-specific, strongly English-vocabulary-dependent, and not yet aligned cleanly enough with BenchLM canonical model rows to use as a ranking input.
BenchLM mirrors the published difficulty-weighted score view for LisanBench. Claude Opus 4.7 leads the public snapshot at 5122.60 , followed by Opus 4.6 (16k) (3526.49) and GPT-5.5 (3315.52). BenchLM does not use these results to rank models overall.
Claude Opus 4.7
Anthropic
anthropic/claude-opus-4.7:thinking-xhigh
Opus 4.6 (16k)
Anthropic
anthropic/claude-opus-4.6:thinking-16k
GPT-5.5
OpenAI
openai/gpt-5.5:thinking-medium
The published LisanBench snapshot is tightly clustered at the top: Claude Opus 4.7 sits at 5122.60, while the third row is only 1807.08 points behind. The broader top-10 spread is 3599.60 points, so the benchmark still separates strong models even when the leaders cluster.
139 models have been evaluated on LisanBench. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. LisanBench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.
Year
2026
Tasks
50 starting words × 3 trials
Format
Difficulty-weighted word-chain reasoning
Difficulty
Open-ended lexical planning
BenchLM mirrors the public difficulty-weighted LisanBench leaderboard as a display-only reasoning benchmark. The public benchmark currently evaluates 128 model variants across 50 starting words with 3 trials per word.
Version
LisanBench 2026
Refresh cadence
Static
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A word-chain reasoning benchmark that tests planning, recall, constraint following, and vocabulary depth by asking models to extend non-repeating edit-distance-1 chains.
Claude Opus 4.7 currently leads the published LisanBench snapshot with 5122.60 difficulty-weighted score. BenchLM shows this benchmark for display only and does not use it in overall rankings.
139 AI models are included in BenchLM's mirrored LisanBench snapshot, based on the public leaderboard captured on June 2, 2026 snapshot.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.