Benchmark profile

Artificial Analysis GPQA Diamond (AA-GPQA Diamond)

A display-only Artificial Analysis GPQA Diamond score.

Data verified July 23, 2026

Benchmark score on AA-GPQA Diamond — July 23, 2026

BenchLM mirrors the published score view for AA-GPQA Diamond. GPT-5.6 Sol leads the public snapshot at 94.1% , followed by Gemini 3.1 Pro (94.1%) and Kimi K3 (93.5%). BenchLM does not use these results to rank models overall.

1Closed

GPT-5.6 Sol

OpenAI

gpt-5-6-sol

94.1%

Overall 81.96Context 1M

2Closed

Gemini 3.1 Pro

Google

gemini-3-1-pro

94.1%

Overall 55.3Context 1M

3Closed

Kimi K3

Moonshot AI

kimi-3

93.5%

Overall 80.96Context 1.05M

162 modelsKnowledgeCurrentDisplay onlyUpdated July 23, 2026

Benchmark score table (162 models)

Score

GPT-5.6 SolOpenAI · Closed

94.1%

Gemini 3.1 ProGoogle · Closed

94.1%

Kimi K3Moonshot AI · Closed

93.5%

GPT-5.5OpenAI · Closed

93.5%

Grok 4.5xAI · Closed

93.1%

MiniMax M3MiniMax · Open weight

92.9%

Gemini 3.6 FlashGoogle · Closed

92.8%

Claude Fable 5Anthropic · Closed

92.6%

GPT-5.6 TerraOpenAI · Closed

92.5%

Qwen3.7 MaxAlibaba · Closed

92.3%

Gemini 3.5 FlashGoogle · Closed

92.2%

Claude Opus 4.8Anthropic · Closed

92.0%

GPT-5.4OpenAI · Closed

92.0%

GPT-5.3 CodexOpenAI · Closed

91.5%

GPT-5.3-Codex-SparkOpenAI · Closed

91.5%

Claude Opus 4.7 (Adaptive)Anthropic · Closed

91.4%

Claude Sonnet 5Anthropic · Closed

91.1%

GPT-5.6 LunaOpenAI · Closed

91.1%

Kimi K2.6Moonshot AI · Open weight

91.1%

Gemini 3 ProGoogle · Closed

90.8%

DeepSeek V4 Pro (High)DeepSeek · Open weight

90.5%

GPT-5.2OpenAI · Closed

90.3%

Grok 4.3xAI · Closed

90.1%

Qwen3.7 PlusAlibaba · Closed

90.0%

GPT-5.2-CodexOpenAI · Closed

89.9%

Muse Spark 1.1Meta · Closed

89.8%

Hy3 PreviewTencent · Open weight

89.7%

Hy3Tencent · Open weight

89.7%

Claude Opus 4.6 (Adaptive)Anthropic · Closed

89.6%

Kimi K2.7 CodeMoonshot AI · Open weight

89.6%

GLM-5.2Z.AI · Open weight

89.5%

DeepSeek V4 Flash (Max)DeepSeek · Open weight

89.4%

Qwen3.5 397BAlibaba · Open weight

89.3%

Qwen3.5 397B (Reasoning)Alibaba · Open weight

89.3%

DeepSeek V4 Pro (Max)DeepSeek · Open weight

88.8%

Qwen 3.6 Max (preview)Alibaba · Closed

88.8%

Claude Opus 4.7Anthropic · Closed

88.5%

Muse SparkMeta · Closed

88.4%

Qwen3.6 PlusAlibaba · Closed

88.2%

Kimi K2.5Moonshot AI · Open weight

87.9%

Kimi K2.5 (Reasoning)Moonshot AI · Closed

87.9%

Grok 4xAI · Closed

87.7%

GPT-5.4 miniOpenAI · Closed

87.5%

MiniMax M2.7MiniMax · Open weight

87.4%

GPT-5.1OpenAI · Closed

87.3%

InklingThinking Machines Lab · Open weight

87.2%

MiMo-V2-ProXiaomi · Closed

87.0%

GLM-5.1Z.AI · Open weight

86.8%

Nemotron 3 UltraNVIDIA · Open weight

86.7%

DeepSeek V4 Flash (High)DeepSeek · Open weight

86.7%

MiMo-V2.5-ProXiaomi · Closed

86.6%

Claude Opus 4.5 ThinkingAnthropic · Closed

86.6%

GPT-5.1-Codex-MaxOpenAI · Closed

86.0%

GPT-5.1-CodexOpenAI · Closed

86.0%

GLM-4.7Z.AI · Open weight

85.9%

Qwen3.5-27BAlibaba · Open weight

85.8%

Qwen3.5-122B-A10BAlibaba · Open weight

85.7%

Gemma 4 31BGoogle · Open weight

85.7%

GPT-5 (high)OpenAI · Closed

85.4%

Grok 4.1 Fast (Reasoning)xAI · Closed

85.3%

MiniMax M2.5MiniMax · Closed

84.8%

GLM-5-TurboZ.AI · Closed

84.7%

Grok 4 Fast (Reasoning)xAI · Closed

84.7%

Qwen3.5-35B-A3BAlibaba · Open weight

84.5%

o3-proOpenAI · Closed

84.5%

Gemini 2.5 ProGoogle · Closed

84.4%

Qwen3.6-27BAlibaba · Open weight

84.2%

GPT-5 (medium)OpenAI · Closed

84.2%

Qwen3.6-35B-A3BAlibaba · Open weight

84.1%

Claude Opus 4.6Anthropic · Closed

84.0%

Gemini 3.5 Flash-LiteGoogle · Closed

83.8%

MiMo-V2-OmniXiaomi · Closed

82.8%

GPT-5 miniOpenAI · Closed

82.8%

o3OpenAI · Closed

82.7%

Step 3.5 FlashStepFun · Open weight

82.6%

Gemini 3.1 Flash-LiteGoogle · Closed

82.2%

GLM-5Z.AI · Open weight

82.0%

GPT-5.4 nanoOpenAI · Closed

81.7%

DeepSeek-R1DeepSeek · Open weight

81.3%

Gemini 3 FlashGoogle · Closed

81.2%

Claude Opus 4.5Anthropic · Closed

81.0%

Step 3.7 FlashStepFun · Open weight

80.9%

GLM-5V-TurboZ.AI · Closed

80.9%

Claude 4.1 Opus ThinkingAnthropic · Closed

80.9%

Nemotron 3 Super 120B A12BNVIDIA · Open weight

80.0%

Claude Sonnet 4.6Anthropic · Closed

79.9%

Gemma 4 26B A4BGoogle · Open weight

79.2%

K-ExaoneLG AI Research · Closed

78.3%

GPT-OSS 120BOpenAI · Open weight

78.2%

DeepSeek V3.1 (Reasoning)DeepSeek · Open weight

77.9%

Mercury 2Inception · Closed

77.0%

Mistral Small 4Mistral · Open weight

76.9%

Mistral Small 4 (Reasoning)Mistral · Open weight

76.9%

Kimi K2Moonshot AI · Closed

76.6%

Qwen3 MaxAlibaba · Closed

76.4%

Command A+Cohere · Open weight

76.1%

Nemotron 3 Nano 30BNVIDIA · Open weight

75.7%

Gemma 4 12BGoogle · Open weight

75.3%

Trinity-Large-PreviewArcee AI · Open weight

75.2%

100

Trinity-Large-ThinkingArcee AI · Open weight

75.2%

101

DeepSeek V3.2DeepSeek · Open weight

75.1%

102

Mistral Medium 3.5 128BMistral · Open weight

74.8%

103

o3-miniOpenAI · Closed

74.8%

104

o1OpenAI · Closed

74.7%

105

Sarvam 105BSarvam · Open weight

73.8%

106

DeepSeek V3.1DeepSeek · Open weight

73.5%

107

GLM-4.5-AirZ.AI · Closed

73.3%

108

Nemotron Ultra 253BNVIDIA · Open weight

72.8%

109

Grok Code Fast 1xAI · Closed

72.7%

110

MiniMax M1 80kMiniMax · Closed

69.7%

111

GPT-OSS 20BOpenAI · Open weight

68.8%

112

Claude 4 SonnetAnthropic · Closed

68.3%

113

Gemini 2.5 FlashGoogle · Closed

68.3%

114

Mistral Large 3Mistral · Closed

68.0%

115

GPT-5 nanoOpenAI · Closed

67.6%

116

Llama 4 MaverickMeta · Open weight

67.1%

117

GPT-4.1OpenAI · Closed

66.6%

118

GPT-4.1 miniOpenAI · Closed

66.4%

119

MiMo-V2-FlashXiaomi · Open weight

65.6%

120

Grok 4.1 FastxAI · Closed

63.7%

121

Sarvam 30BSarvam · Open weight

63.3%

122

GLM-4.6Z.AI · Open weight

63.2%

123

Exaone 4.0 32BLG AI Research · Open weight

62.8%

124

DeepSeek R1 Distill Qwen 32BDeepSeek · Open weight

61.5%

125

Ling 2.6 FlashInclusionAI · Open weight

59.3%

126

Gemini 1.5 ProGoogle · Closed

58.9%

127

Llama 4 ScoutMeta · Open weight

58.7%

128

GLM-4.7-FlashZ.AI · Open weight

58.1%

129

Mistral Medium 3Mistral · Closed

57.8%

130

Gemma 4 E4BGoogle · Open weight

57.6%

131

Phi-4Microsoft · Open weight

57.5%

132

Ministral 3 14B (Reasoning)Mistral · Open weight

57.2%

133

Ministral 3 14BMistral · Open weight

57.2%

134

Solar Pro 2Upstage · Closed

56.1%

135

DeepSeek V3DeepSeek · Open weight

55.7%

136

GPT-4oOpenAI · Closed

54.3%

137

Llama 3.1 405BMeta · Open weight

51.5%

138

LFM2.5-8B-A1BLiquidAI · Open weight

51.3%

139

GPT-4.1 nanoOpenAI · Closed

51.2%

140

Nova ProAmazon · Closed

49.9%

141

Claude 3 OpusAnthropic · Closed

48.9%

142

Mistral Large 2Mistral · Closed

48.6%

143

LFM2-24B-A2BLiquidAI · Closed

47.4%

144

Ministral 3 8B (Reasoning)Mistral · Open weight

47.1%

145

Ministral 3 8BMistral · Open weight

47.1%

146

Nemotron 3 Nano Omni 30B A3BNVIDIA · Open weight

46.9%

147

Gemma 3 27BGoogle · Open weight

42.8%

148

GPT-4o miniOpenAI · Closed

42.6%

149

Exaone 4.0 1.2BLG AI Research · Open weight

42.4%

150

Qwen2.5 Coder 32B InstructAlibaba · Open weight

41.7%

151

Gemma 4 E2BGoogle · Open weight

37.5%

152

Claude 3 HaikuAnthropic · Closed

37.4%

153

Ministral 3 3B (Reasoning)Mistral · Open weight

35.8%

154

Ministral 3 3BMistral · Open weight

35.8%

155

LFM2.5-1.2B-ThinkingLiquidAI · Closed

33.9%

156

LFM2.5-1.2B-InstructLiquidAI · Closed

32.6%

157

LFM2.5-VL-1.6B-ExtractLiquidAI · Open weight

28.9%

158

Granite-4.0-1BIBM · Open weight

28.1%

159

Gemini 1.0 ProGoogle · Closed

27.7%

160

Granite-4.0-H-1BIBM · Open weight

26.3%

161

Granite-4.0-H-350MIBM · Open weight

25.7%

162

Granite-4.0-350MIBM · Open weight

20.3%

The published AA-GPQA Diamond snapshot places GPT-5.6 Sol first at 94.1%. The third row is 0.6 points behind. The broader top-10 range is 1.8 points, so many of the published results sit in a relatively narrow band.

162 models have been evaluated on AA-GPQA Diamond. The benchmark falls in the Knowledge category. This category carries a 12% weight in BenchLM.ai's overall scoring system. AA-GPQA Diamond is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About AA-GPQA Diamond

Year

2026

Tasks

Graduate-level science questions

Format

Accuracy

Difficulty

Graduate-level science reasoning

BenchLM stores the Artificial Analysis GPQA Diamond result separately from the weighted GPQA lane so AA refreshes remain display-only.

Artificial Analysis GPQA Diamond Benchmark Leaderboard

BenchLM freshness & provenance

Version

AA-GPQA Diamond 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does AA-GPQA Diamond measure?

A display-only Artificial Analysis GPQA Diamond score.

Which model scores highest on AA-GPQA Diamond?

GPT-5.6 Sol by OpenAI currently leads with a score of 94.1% on AA-GPQA Diamond.

How many models are evaluated on AA-GPQA Diamond?

162 AI models have been evaluated on AA-GPQA Diamond on BenchLM.

Compare Top Models on AA-GPQA Diamond

GPT-5.6 Sol vs Gemini 3.1 Pro Gemini 3.1 Pro vs Kimi K3 Kimi K3 vs GPT-5.5 GPT-5.5 vs Grok 4.5

Last updated: July 23, 2026 · BenchLM version AA-GPQA Diamond 2026

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.