A cybersecurity task benchmark for evaluating defensive cyber workflows and vulnerability-oriented agent performance.
BenchLM mirrors the published score view for CyberGym. Claude Mythos Preview leads the public snapshot at 83.1% , followed by GPT-5.5 (81.8%) and GPT-5.4 (79.0%). BenchLM does not use these results to rank models overall.
Claude Mythos Preview
Anthropic
GPT-5.5
OpenAI
GPT-5.4
OpenAI
The published CyberGym snapshot is tightly clustered at the top: Claude Mythos Preview sits at 83.1%, while the third row is only 4.1 points behind. The broader top-10 spread is 39.9 points, so the benchmark still separates strong models even when the leaders cluster.
10 models have been evaluated on CyberGym. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. CyberGym is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.
Year
2026
Tasks
1,507 vulnerability analysis instances
Format
Vulnerability reproduction and PoC generation
Difficulty
Real-world cybersecurity
CyberGym includes 1,507 benchmark instances from historical vulnerabilities across 188 large software projects. BenchLM stores CyberGym as a display-only agentic security benchmark when exact provider comparison values are published.
Version
CyberGym 2026
Refresh cadence
Quarterly
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A cybersecurity task benchmark for evaluating defensive cyber workflows and vulnerability-oriented agent performance.
Claude Mythos Preview by Anthropic currently leads with a score of 83.1% on CyberGym.
10 AI models have been evaluated on CyberGym on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.