Benchmark profile

Critical Physics Tasks (CritPt)

A display-only Artificial Analysis metric for research-level physics reasoning.

Data verified July 23, 2026

Benchmark score on CritPt — July 23, 2026

BenchLM mirrors the published score view for CritPt. GPT-5.6 Sol leads the public snapshot at 32.3% , followed by GPT-5.5 Pro (30.6%) and GPT-5.6 Terra (30.0%). BenchLM does not use these results to rank models overall.

1Closed

GPT-5.6 Sol

OpenAI

gpt-5-6-sol

32.3%

Overall 81.96Context 1M

2Closed

GPT-5.5 Pro

OpenAI

gpt-5-5-pro

30.6%

Overall 63.69Context 1M

3Closed

GPT-5.6 Terra

OpenAI

gpt-5-6-terra

30.0%

Overall 72.57Context 1M

157 modelsReasoningCurrentDisplay onlyUpdated July 23, 2026

Benchmark score table (157 models)

Score

GPT-5.6 SolOpenAI · Closed

32.3%

GPT-5.5 ProOpenAI · Closed

30.6%

GPT-5.6 TerraOpenAI · Closed

30.0%

GPT-5.4 ProOpenAI · Closed

30.0%

Claude Fable 5Anthropic · Closed

28.6%

GPT-5.5OpenAI · Closed

27.1%

Gemini 3 Pro Deep ThinkGoogle · Closed

25.7%

Kimi K3Moonshot AI · Closed

23.4%

GPT-5.4OpenAI · Closed

23.4%

Claude Opus 4.8Anthropic · Closed

20.9%

GLM-5.2Z.AI · Open weight

20.9%

GPT-5.6 LunaOpenAI · Closed

20.6%

Gemini 3.1 ProGoogle · Closed

17.7%

Claude Sonnet 5Anthropic · Closed

16.9%

GPT-5.3 CodexOpenAI · Closed

16.9%

GPT-5.3-Codex-SparkOpenAI · Closed

16.9%

Grok 4.5xAI · Closed

15.4%

Muse Spark 1.1Meta · Closed

15.1%

Qwen3.7 MaxAlibaba · Closed

13.4%

Gemini 3.5 FlashGoogle · Closed

13.1%

DeepSeek V4 Pro (Max)DeepSeek · Open weight

12.9%

Claude Opus 4.6 (Adaptive)Anthropic · Closed

12.6%

Claude Opus 4.7 (Adaptive)Anthropic · Closed

12.0%

GPT-5.2OpenAI · Closed

11.6%

Muse SparkMeta · Closed

11.3%

Gemini 3.6 FlashGoogle · Closed

10.6%

DeepSeek V4 Pro (High)DeepSeek · Open weight

10.0%

GPT-5.4 miniOpenAI · Closed

10.0%

Kimi K2.7 CodeMoonshot AI · Open weight

10.0%

GPT-5.4 nanoOpenAI · Closed

9.3%

Qwen3.7 PlusAlibaba · Closed

9.1%

Gemini 3 ProGoogle · Closed

9.1%

GPT-5.2-CodexOpenAI · Closed

8.7%

Kimi K2.6Moonshot AI · Open weight

8.0%

Grok 4.3xAI · Closed

8.0%

DeepSeek V4 Flash (Max)DeepSeek · Open weight

7.1%

GPT-5 (high)OpenAI · Closed

5.7%

GPT-5.1-Codex-MaxOpenAI · Closed

5.7%

GPT-5.1-CodexOpenAI · Closed

5.7%

InklingThinking Machines Lab · Open weight

5.4%

Claude Opus 4.7Anthropic · Closed

5.1%

GPT-5.1OpenAI · Closed

4.9%

Hy3 PreviewTencent · Open weight

4.9%

Hy3Tencent · Open weight

4.9%

GLM-5.1Z.AI · Open weight

4.6%

Claude Opus 4.5 ThinkingAnthropic · Closed

4.6%

MiMo-V2.5-ProXiaomi · Closed

4.0%

MiniMax M3MiniMax · Open weight

3.7%

Qwen 3.6 Max (preview)Alibaba · Closed

3.7%

DeepSeek V4 Flash (High)DeepSeek · Open weight

3.4%

Kimi K2.5Moonshot AI · Open weight

3.1%

Nemotron 3 UltraNVIDIA · Open weight

3.1%

Kimi K2.5 (Reasoning)Moonshot AI · Closed

3.1%

Nemotron 3 Super 120B A12BNVIDIA · Open weight

3.1%

Qwen3.6 PlusAlibaba · Closed

2.9%

Grok 4.1 Fast (Reasoning)xAI · Closed

2.9%

Grok 4 Fast (Reasoning)xAI · Closed

2.9%

Claude Opus 4.6Anthropic · Closed

2.8%

Gemini 2.5 ProGoogle · Closed

2.6%

Step 3.7 FlashStepFun · Open weight

2.3%

Step 3.5 FlashStepFun · Open weight

2.3%

GLM-5Z.AI · Open weight

2.0%

Grok 4xAI · Closed

2.0%

DeepSeek V3.1 (Reasoning)DeepSeek · Open weight

2.0%

Qwen3.5 397BAlibaba · Open weight

1.7%

GLM-4.7Z.AI · Open weight

1.7%

Qwen3.5 397B (Reasoning)Alibaba · Open weight

1.7%

Gemini 3 FlashGoogle · Closed

1.4%

Gemma 4 31BGoogle · Open weight

1.4%

Gemini 2.5 FlashGoogle · Closed

1.4%

DeepSeek-R1DeepSeek · Open weight

1.4%

GPT-OSS 20BOpenAI · Open weight

1.4%

Qwen3.6-27BAlibaba · Open weight

1.1%

Claude 4 SonnetAnthropic · Closed

1.1%

Gemini 3.1 Flash-LiteGoogle · Closed

1.1%

o3OpenAI · Closed

1.1%

MiMo-V2-OmniXiaomi · Closed

1.1%

GPT-OSS 120BOpenAI · Open weight

1.1%

K-ExaoneLG AI Research · Closed

1.1%

MiniMax M2.5MiniMax · Closed

1.1%

Claude Sonnet 4.6Anthropic · Closed

0.9%

Qwen3.5-27BAlibaba · Open weight

0.9%

Qwen3.5-35B-A3BAlibaba · Open weight

0.9%

DeepSeek V3.2DeepSeek · Open weight

0.9%

Trinity-Large-PreviewArcee AI · Open weight

0.9%

Trinity-Large-ThinkingArcee AI · Open weight

0.9%

Nemotron 3 Nano 30BNVIDIA · Open weight

0.9%

Mercury 2Inception · Closed

0.8%

Qwen3.5-122B-A10BAlibaba · Open weight

0.6%

MiniMax M2.7MiniMax · Open weight

0.6%

GLM-5V-TurboZ.AI · Closed

0.6%

Gemma 4 E4BGoogle · Open weight

0.6%

Claude Opus 4.5Anthropic · Closed

0.3%

Qwen3.6-35B-A3BAlibaba · Open weight

0.3%

o1OpenAI · Closed

0.3%

Command A+Cohere · Open weight

0.3%

GLM-5-TurboZ.AI · Closed

0.3%

MiMo-V2-ProXiaomi · Closed

0.3%

Mistral Small 4Mistral · Open weight

0.3%

100

Sarvam 30BSarvam · Open weight

0.3%

101

Mistral Small 4 (Reasoning)Mistral · Open weight

0.3%

102

GLM-4.7-FlashZ.AI · Open weight

0.3%

103

Exaone 4.0 32BLG AI Research · Open weight

0.0%

104

Mistral Medium 3.5 128BMistral · Open weight

0.0%

105

MiMo-V2-FlashXiaomi · Open weight

0.0%

106

Nemotron 3 Nano Omni 30B A3BNVIDIA · Open weight

0.0%

107

Gemini 3.5 Flash-LiteGoogle · Closed

0.0%

108

LFM2.5-8B-A1BLiquidAI · Open weight

0.0%

109

Kimi K2Moonshot AI · Closed

0.0%

110

Gemma 4 26B A4BGoogle · Open weight

0.0%

111

GPT-4.1 nanoOpenAI · Closed

0.0%

112

GPT-4.1OpenAI · Closed

0.0%

113

GPT-4.1 miniOpenAI · Closed

0.0%

114

Gemma 4 12BGoogle · Open weight

0.0%

115

GLM-4.6Z.AI · Open weight

0.0%

116

DeepSeek V3DeepSeek · Open weight

0.0%

117

GPT-4oOpenAI · Closed

0.0%

118

Llama 4 ScoutMeta · Open weight

0.0%

119

Llama 4 MaverickMeta · Open weight

0.0%

120

Ling 2.6 FlashInclusionAI · Open weight

0.0%

121

Grok 4.1 FastxAI · Closed

0.0%

122

DeepSeek V3.1DeepSeek · Open weight

0.0%

123

Mistral Large 3Mistral · Closed

0.0%

124

GLM-4.5-AirZ.AI · Closed

0.0%

125

Gemma 3 27BGoogle · Open weight

0.0%

126

GPT-5 (medium)OpenAI · Closed

0.0%

127

Mistral Large 2Mistral · Closed

0.0%

128

Llama 3.1 405BMeta · Open weight

0.0%

129

Phi-4Microsoft · Open weight

0.0%

130

Grok Code Fast 1xAI · Closed

0.0%

131

Nemotron Ultra 253BNVIDIA · Open weight

0.0%

132

Claude 3 HaikuAnthropic · Closed

0.0%

133

Claude 4.1 Opus ThinkingAnthropic · Closed

0.0%

134

Nova ProAmazon · Closed

0.0%

135

LFM2.5-VL-1.6B-ExtractLiquidAI · Open weight

0.0%

136

Qwen3 MaxAlibaba · Closed

0.0%

137

Gemma 4 E2BGoogle · Open weight

0.0%

138

Sarvam 105BSarvam · Open weight

0.0%

139

Mistral Medium 3Mistral · Closed

0.0%

140

Granite-4.0-1BIBM · Open weight

0.0%

141

Granite-4.0-350MIBM · Open weight

0.0%

142

Granite-4.0-H-1BIBM · Open weight

0.0%

143

Granite-4.0-H-350MIBM · Open weight

0.0%

144

Exaone 4.0 1.2BLG AI Research · Open weight

0.0%

145

Solar Pro 2Upstage · Closed

0.0%

146

GPT-5 miniOpenAI · Closed

0.0%

147

Ministral 3 14B (Reasoning)Mistral · Open weight

0.0%

148

Ministral 3 14BMistral · Open weight

0.0%

149

GPT-5 nanoOpenAI · Closed

0.0%

150

MiniMax M1 80kMiniMax · Closed

0.0%

151

LFM2-24B-A2BLiquidAI · Closed

0.0%

152

Ministral 3 8B (Reasoning)Mistral · Open weight

0.0%

153

LFM2.5-1.2B-ThinkingLiquidAI · Closed

0.0%

154

Ministral 3 8BMistral · Open weight

0.0%

155

Ministral 3 3B (Reasoning)Mistral · Open weight

0.0%

156

LFM2.5-1.2B-InstructLiquidAI · Closed

0.0%

157

Ministral 3 3BMistral · Open weight

0.0%

The published CritPt snapshot places GPT-5.6 Sol first at 32.3%. The third row is 2.3 points behind. The broader top-10 range is 11.4 points, so the table still separates the published systems.

157 models have been evaluated on CritPt. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. CritPt is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About CritPt

Year

2026

Tasks

Research-level physics questions

Format

Accuracy

Difficulty

Research-level physics reasoning

BenchLM stores CritPt as a display-only research-physics reasoning row from Artificial Analysis' independently evaluated leaderboard.

CritPt Benchmark Leaderboard

BenchLM freshness & provenance

Version

CritPt 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does CritPt measure?

A display-only Artificial Analysis metric for research-level physics reasoning.

Which model scores highest on CritPt?

GPT-5.6 Sol by OpenAI currently leads with a score of 32.3% on CritPt.

How many models are evaluated on CritPt?

157 AI models have been evaluated on CritPt on BenchLM.

Compare Top Models on CritPt

GPT-5.6 Sol vs GPT-5.5 Pro GPT-5.5 Pro vs GPT-5.6 Terra GPT-5.6 Terra vs GPT-5.4 Pro GPT-5.4 Pro vs Claude Fable 5

Last updated: July 23, 2026 · BenchLM version CritPt 2026

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.