A multimodal variant of SWE-bench that adds visual context such as screenshots and design mockups to software engineering issue descriptions.
BenchLM mirrors the published score view for SWE Multimodal. Claude Opus 4.8 leads the public snapshot at 38.4%. BenchLM does not use these results to rank models overall.
Year
2025
Tasks
Multimodal software engineering tasks
Format
Code patch generation with visual context
Difficulty
Frontier multimodal coding
BenchLM stores provider-reported SWE-bench Multimodal values in the coding category when the model vendor reports the benchmark as part of a software-engineering capability suite.
Version
SWE Multimodal 2025
Refresh cadence
Quarterly
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A multimodal variant of SWE-bench that adds visual context such as screenshots and design mockups to software engineering issue descriptions.
Claude Opus 4.8 by Anthropic currently leads with a score of 38.4% on SWE Multimodal.
1 AI models have been evaluated on SWE Multimodal on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.