Benchmark profile

Artificial Analysis SciCode (AA-SciCode)

A display-only Artificial Analysis SciCode score.

Data verified July 23, 2026

Benchmark score on AA-SciCode — July 23, 2026

BenchLM mirrors the published score view for AA-SciCode. Claude Fable 5 leads the public snapshot at 60.2% , followed by Gemini 3.1 Pro (58.9%) and Kimi K3 (58.7%). BenchLM does not use these results to rank models overall.

1Closed

Claude Fable 5

Anthropic

claude-fable-5

60.2%

Overall 83.68Context 1M+

2Closed

Gemini 3.1 Pro

Google

gemini-3-1-pro

58.9%

Overall 55.3Context 1M

3Closed

Kimi K3

Moonshot AI

kimi-3

58.7%

Overall 80.96Context 1.05M

162 modelsCodingCurrentDisplay onlyUpdated July 23, 2026

Benchmark score table (162 models)

Score

Claude Fable 5Anthropic · Closed

60.2%

Gemini 3.1 ProGoogle · Closed

58.9%

Kimi K3Moonshot AI · Closed

58.7%

Muse Spark 1.1Meta · Closed

58.2%

GPT-5.4OpenAI · Closed

56.6%

GPT-5.6 SolOpenAI · Closed

56.1%

GPT-5.5OpenAI · Closed

56.1%

Gemini 3 ProGoogle · Closed

56.1%

GPT-5.2-CodexOpenAI · Closed

54.6%

Claude Opus 4.7 (Adaptive)Anthropic · Closed

54.5%

Grok 4.5xAI · Closed

54.1%

GPT-5.6 TerraOpenAI · Closed

53.9%

Claude Sonnet 5Anthropic · Closed

53.6%

Claude Opus 4.8Anthropic · Closed

53.5%

Kimi K2.6Moonshot AI · Open weight

53.5%

GPT-5.3 CodexOpenAI · Closed

53.2%

GPT-5.3-Codex-SparkOpenAI · Closed

53.2%

Gemini 3.5 FlashGoogle · Closed

53.1%

Gemini 3.6 FlashGoogle · Closed

52.7%

GPT-5.6 LunaOpenAI · Closed

52.5%

GPT-5.2OpenAI · Closed

52.1%

Claude Opus 4.6 (Adaptive)Anthropic · Closed

51.9%

Muse SparkMeta · Closed

51.5%

GLM-5.2Z.AI · Open weight

50.5%

MiMo-V2.5-ProXiaomi · Closed

50.2%

Claude Opus 4.7Anthropic · Closed

50.1%

DeepSeek V4 Pro (Max)DeepSeek · Open weight

50.0%

GPT-5.4 miniOpenAI · Closed

49.9%

Gemini 3 FlashGoogle · Closed

49.9%

Claude Opus 4.5 ThinkingAnthropic · Closed

49.5%

Kimi K2.5Moonshot AI · Open weight

49.0%

Kimi K2.5 (Reasoning)Moonshot AI · Closed

49.0%

Qwen3.7 MaxAlibaba · Closed

48.8%

Hy3 PreviewTencent · Open weight

47.6%

Hy3Tencent · Open weight

47.6%

Kimi K2.7 CodeMoonshot AI · Open weight

47.5%

Grok 4.3xAI · Closed

47.3%

Claude Opus 4.5Anthropic · Closed

47.0%

MiniMax M2.7MiniMax · Open weight

47.0%

Claude Sonnet 4.6Anthropic · Closed

46.9%

GPT-5.4 nanoOpenAI · Closed

46.9%

Qwen 3.6 Max (preview)Alibaba · Closed

46.9%

DeepSeek V4 Pro (High)DeepSeek · Open weight

46.4%

GLM-5Z.AI · Open weight

46.2%

InklingThinking Machines Lab · Open weight

46.1%

Claude Opus 4.6Anthropic · Closed

45.7%

Grok 4xAI · Closed

45.7%

Qwen3.7 PlusAlibaba · Closed

45.5%

MiniMax M3MiniMax · Open weight

45.4%

GLM-4.7Z.AI · Open weight

45.1%

DeepSeek V4 Flash (Max)DeepSeek · Open weight

44.9%

Grok 4.1 Fast (Reasoning)xAI · Closed

44.2%

Grok 4 Fast (Reasoning)xAI · Closed

44.2%

GLM-5.1Z.AI · Open weight

43.8%

GLM-5-TurboZ.AI · Closed

43.6%

GLM-5V-TurboZ.AI · Closed

43.5%

Gemma 4 31BGoogle · Open weight

43.4%

GPT-5.1OpenAI · Closed

43.3%

GPT-5 (high)OpenAI · Closed

42.9%

Gemini 2.5 ProGoogle · Closed

42.8%

MiniMax M2.5MiniMax · Closed

42.6%

MiMo-V2-ProXiaomi · Closed

42.5%

Qwen3.5 397BAlibaba · Open weight

42.0%

Qwen3.5-122B-A10BAlibaba · Open weight

42.0%

DeepSeek V4 Flash (High)DeepSeek · Open weight

42.0%

Qwen3.5 397B (Reasoning)Alibaba · Open weight

42.0%

Gemini 3.1 Flash-LiteGoogle · Closed

41.9%

GPT-5 (medium)OpenAI · Closed

41.1%

o3OpenAI · Closed

41.0%

Gemini 3.5 Flash-LiteGoogle · Closed

40.9%

Claude 4.1 Opus ThinkingAnthropic · Closed

40.9%

Qwen3.6 PlusAlibaba · Closed

40.7%

GPT-4.1 miniOpenAI · Closed

40.4%

DeepSeek-R1DeepSeek · Open weight

40.3%

GPT-5.1-Codex-MaxOpenAI · Closed

40.2%

GPT-5.1-CodexOpenAI · Closed

40.2%

Step 3.7 FlashStepFun · Open weight

40.0%

Gemma 4 26B A4BGoogle · Open weight

40.0%

Nemotron 3 UltraNVIDIA · Open weight

39.9%

o3-miniOpenAI · Closed

39.9%

Qwen3.6-27BAlibaba · Open weight

39.8%

Mistral Medium 3.5 128BMistral · Open weight

39.6%

Qwen3.5-27BAlibaba · Open weight

39.5%

GPT-5 miniOpenAI · Closed

39.2%

DeepSeek V3.1 (Reasoning)DeepSeek · Open weight

39.1%

GPT-OSS 120BOpenAI · Open weight

38.9%

DeepSeek V3.2DeepSeek · Open weight

38.7%

Mercury 2Inception · Closed

38.7%

Step 3.5 FlashStepFun · Open weight

38.5%

Qwen3 MaxAlibaba · Closed

38.3%

Gemma 4 12BGoogle · Open weight

38.2%

GPT-4.1OpenAI · Closed

38.1%

Mistral Small 4Mistral · Open weight

38.0%

Mistral Small 4 (Reasoning)Mistral · Open weight

38.0%

Command A+Cohere · Open weight

37.8%

Qwen3.5-35B-A3BAlibaba · Open weight

37.7%

DeepSeek R1 Distill Qwen 32BDeepSeek · Open weight

37.6%

MiniMax M1 80kMiniMax · Closed

37.4%

Claude 4 SonnetAnthropic · Closed

37.3%

100

MiMo-V2-OmniXiaomi · Closed

36.7%

101

DeepSeek V3.1DeepSeek · Open weight

36.7%

102

GPT-5 nanoOpenAI · Closed

36.6%

103

Mistral Large 3Mistral · Closed

36.2%

104

Grok Code Fast 1xAI · Closed

36.2%

105

Trinity-Large-PreviewArcee AI · Open weight

36.1%

106

Trinity-Large-ThinkingArcee AI · Open weight

36.1%

107

Nemotron 3 Super 120B A12BNVIDIA · Open weight

36.0%

108

Qwen3.6-35B-A3BAlibaba · Open weight

35.8%

109

o1OpenAI · Closed

35.8%

110

K-ExaoneLG AI Research · Closed

35.6%

111

DeepSeek V3DeepSeek · Open weight

35.4%

112

Nemotron Ultra 253BNVIDIA · Open weight

34.7%

113

Kimi K2Moonshot AI · Closed

34.5%

114

GPT-OSS 20BOpenAI · Open weight

34.4%

115

GLM-4.7-FlashZ.AI · Open weight

33.7%

116

GPT-4oOpenAI · Closed

33.3%

117

GLM-4.6Z.AI · Open weight

33.1%

118

Llama 4 MaverickMeta · Open weight

33.1%

119

Mistral Medium 3Mistral · Closed

33.1%

120

GPT-4 TurboOpenAI · Closed

31.9%

121

GLM-4.5-AirZ.AI · Closed

30.6%

122

Llama 3.1 405BMeta · Open weight

29.9%

123

Grok 4.1 FastxAI · Closed

29.6%

124

Nemotron 3 Nano 30BNVIDIA · Open weight

29.6%

125

Gemini 1.5 ProGoogle · Closed

29.5%

126

Mistral Large 2Mistral · Closed

29.2%

127

Gemini 2.5 FlashGoogle · Closed

29.1%

128

Nemotron 3 Nano Omni 30B A3BNVIDIA · Open weight

27.8%

129

Ling 2.6 FlashInclusionAI · Open weight

27.1%

130

Qwen2.5 Coder 32B InstructAlibaba · Open weight

27.1%

131

Sarvam 105BSarvam · Open weight

26.4%

132

Phi-4Microsoft · Open weight

26.0%

133

MiMo-V2-FlashXiaomi · Open weight

25.9%

134

GPT-4.1 nanoOpenAI · Closed

25.9%

135

Exaone 4.0 32BLG AI Research · Open weight

25.2%

136

Solar Pro 2Upstage · Closed

24.8%

137

Gemma 4 E4BGoogle · Open weight

24.4%

138

Ministral 3 14B (Reasoning)Mistral · Open weight

23.6%

139

Ministral 3 14BMistral · Open weight

23.6%

140

Claude 3 OpusAnthropic · Closed

23.3%

141

GPT-4o miniOpenAI · Closed

22.9%

142

Gemma 3 27BGoogle · Open weight

21.2%

143

Gemma 4 E2BGoogle · Open weight

20.9%

144

Nova ProAmazon · Closed

20.8%

145

Ministral 3 8B (Reasoning)Mistral · Open weight

20.8%

146

Ministral 3 8BMistral · Open weight

20.8%

147

Sarvam 30BSarvam · Open weight

19.2%

148

Claude 3 HaikuAnthropic · Closed

18.6%

149

Llama 4 ScoutMeta · Open weight

17.0%

150

Ministral 3 3B (Reasoning)Mistral · Open weight

14.4%

151

Ministral 3 3BMistral · Open weight

14.4%

152

Gemini 1.0 ProGoogle · Closed

11.7%

153

LFM2-24B-A2BLiquidAI · Closed

10.9%

154

Granite-4.0-1BIBM · Open weight

8.7%

155

Granite-4.0-H-1BIBM · Open weight

8.2%

156

LFM2.5-8B-A1BLiquidAI · Open weight

7.8%

157

Exaone 4.0 1.2BLG AI Research · Open weight

7.4%

158

LFM2.5-1.2B-ThinkingLiquidAI · Closed

4.2%

159

LFM2.5-VL-1.6B-ExtractLiquidAI · Open weight

3.0%

160

LFM2.5-1.2B-InstructLiquidAI · Closed

2.3%

161

Granite-4.0-H-350MIBM · Open weight

1.7%

162

Granite-4.0-350MIBM · Open weight

0.9%

The published AA-SciCode snapshot places Claude Fable 5 first at 60.2%. The third row is 1.5 points behind. The broader top-10 range is 5.7 points, so many of the published results sit in a relatively narrow band.

162 models have been evaluated on AA-SciCode. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. AA-SciCode is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About AA-SciCode

Year

2026

Tasks

Scientific coding subproblems

Format

Task success rate

Difficulty

Scientific programming

BenchLM stores the Artificial Analysis SciCode result separately from the weighted SciCode lane so AA refreshes remain display-only.

Artificial Analysis SciCode Benchmark Leaderboard

BenchLM freshness & provenance

Version

AA-SciCode 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does AA-SciCode measure?

A display-only Artificial Analysis SciCode score.

Which model scores highest on AA-SciCode?

Claude Fable 5 by Anthropic currently leads with a score of 60.2% on AA-SciCode.

How many models are evaluated on AA-SciCode?

162 AI models have been evaluated on AA-SciCode on BenchLM.

Compare Top Models on AA-SciCode

Claude Fable 5 vs Gemini 3.1 Pro Gemini 3.1 Pro vs Kimi K3 Kimi K3 vs Muse Spark 1.1 Muse Spark 1.1 vs GPT-5.4

Last updated: July 23, 2026 · BenchLM version AA-SciCode 2026

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.