Benchmark profile

CyberGym

A cybersecurity task benchmark for evaluating defensive cyber workflows and vulnerability-oriented agent performance.

Data verified July 23, 2026

Benchmark score on CyberGym — July 23, 2026

BenchLM mirrors the published score view for CyberGym. Fugu Cyber leads the public snapshot at 86.9% , followed by GPT-5.6 Sol (84.5%) and GPT-5.6 Terra (81.8%). BenchLM does not use these results to rank models overall.

1Closed

Fugu Cyber

Sakana AI

sakana-fugu-cyber

86.9%

Overall —Context 1M

2Closed

GPT-5.6 Sol

OpenAI

gpt-5-6-sol

84.5%

Overall 81.96Context 1M

3Closed

GPT-5.6 Terra

OpenAI

gpt-5-6-terra

81.8%

Overall 72.57Context 1M

14 modelsAgenticCurrentDisplay onlyUpdated July 23, 2026

Benchmark score table (14 models)

Score

Fugu CyberSakana AI · Closed

86.9%

GPT-5.6 SolOpenAI · Closed

84.5%

GPT-5.6 TerraOpenAI · Closed

81.8%

GPT-5.5OpenAI · Closed

81.8%

GPT-5.4OpenAI · Closed

79.0%

GPT-5.6 LunaOpenAI · Closed

77.9%

Claude Opus 4.7 (Adaptive)Anthropic · Closed

73.1%

GLM-5.1Z.AI · Open weight

68.7%

Claude Opus 4.6Anthropic · Closed

66.6%

Claude Sonnet 4.6Anthropic · Closed

65.2%

Muse Spark 1.1Meta · Closed

59.0%

Claude Opus 4.5Anthropic · Closed

50.6%

Muse SparkMeta · Closed

43.5%

GLM-5Z.AI · Open weight

43.2%

The published CyberGym snapshot places Fugu Cyber first at 86.9%. The third row is 5.1 points behind. The broader top-10 range is 21.7 points, so the table still separates the published systems.

14 models have been evaluated on CyberGym. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. CyberGym is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About CyberGym

Year

2026

Tasks

1,507 vulnerability analysis instances

Format

Vulnerability reproduction and PoC generation

Difficulty

Real-world cybersecurity

CyberGym includes 1,507 benchmark instances from historical vulnerabilities across 188 large software projects. BenchLM stores CyberGym as a display-only agentic security benchmark when exact provider comparison values are published.

CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale

BenchLM freshness & provenance

Version

CyberGym 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does CyberGym measure?

A cybersecurity task benchmark for evaluating defensive cyber workflows and vulnerability-oriented agent performance.

Which model scores highest on CyberGym?

Fugu Cyber by Sakana AI currently leads with a score of 86.9% on CyberGym.

How many models are evaluated on CyberGym?

14 AI models have been evaluated on CyberGym on BenchLM.

Compare Top Models on CyberGym

Fugu Cyber vs GPT-5.6 Sol GPT-5.6 Sol vs GPT-5.6 Terra GPT-5.6 Terra vs GPT-5.5 GPT-5.5 vs GPT-5.4

Last updated: July 23, 2026 · BenchLM version CyberGym 2026

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.