Benchmark profile

BullshitBench v2

A benchmark that tests whether AI models challenge nonsensical, ill-posed, or logically flawed prompts instead of confidently generating incorrect answers. Measures the critical ability to push back on bad input.

How BenchLM shows BullshitBench v2

BenchLM mirrors the published BullshitBench v2 leaderboard using the official snapshot generated on July 16, 2026 at 9:22 PM UTC. The public view reports per-model clear-pushback rates across 100 nonsense prompts, scored by a 3-judge panel.

BullshitBench is a useful reasoning sanity check, but BenchLM currently keeps it display only rather than weighted. The public leaderboard is highly variant-specific and exposes reasoning-effort settings directly, so BenchLM treats it as a mirrored external benchmark instead of a canonical ranking input.

182 model variants105 base models100 nonsense prompts3 judgesDisplay only

BullshitBench viewer Benchmark homepage Published leaderboard CSV

Clear pushback rate on BullshitBench v2 — July 16, 2026 at 9:22 PM UTC

BenchLM mirrors the published clear pushback rate view for BullshitBench v2. Claude Opus 4.8 (none) leads the public snapshot at 95% , followed by Claude Opus 4.8 (xhigh) (94%) and Claude Sonnet 4.6 (high) (91%). BenchLM does not use these results to rank models overall.

1Closed

Claude Opus 4.8 (none)

Anthropic

anthropic/claude-opus-4.8@reasoning=none

95%

Overall 78.34Context 1M

2Closed

Claude Opus 4.8 (xhigh)

Anthropic

anthropic/claude-opus-4.8@reasoning=xhigh

94%

Overall 78.34Context 1M

3Closed

Claude Sonnet 4.6 (high)

Anthropic

anthropic/claude-sonnet-4.6@reasoning=high

91%

Overall 65.07Context 200K

182 modelsReasoningCurrentDisplay onlyUpdated July 16, 2026 at 9:22 PM UTC

Clear pushback rate table (182 models)

Score

Claude Opus 4.8 (none)Anthropic · Closed

95%

Claude Opus 4.8 (xhigh)Anthropic · Closed

94%

Claude Sonnet 4.6 (high)Anthropic · Closed

91%

Claude Opus 4.5 (high)Anthropic · Closed

90%

Claude Sonnet 4.6 (none)Anthropic · Closed

89%

Claude Opus 4.6 (high)Anthropic · Closed

87%

Claude Opus 4.6 (none)Anthropic · Closed

83%

Claude Opus 4.7 (none)Anthropic · Closed

83%

Claude Sonnet 5 (low)Anthropic · Closed

80%

Claude Sonnet 4.5 (high)Anthropic · Closed

79%

Claude Opus 4.5 (none)Anthropic · Closed

79%

Claude Sonnet 5 (max)Anthropic · Closed

78%

Qwen3.5 397B (Reasoning) (high)Alibaba · Open weight

78%

Claude Haiku 4.5 (high)Anthropic · Closed

77%

Claude Opus 4.7 (max)Anthropic · Closed

74%

Claude Sonnet 4.5 (none)Anthropic · Closed

74%

Kimi K3 (xhigh)Moonshot AI · Closed

73%

Qwen3.6 Plus (none)Alibaba · Closed

72%

Kimi K3 (minimal)Moonshot AI · Closed

71%

Claude Haiku 4.5 (none)Anthropic · Closed

71%

Qwen3.7 Max (none)Alibaba · Closed

71%

Qwen3.5 397B (none)Alibaba · Open weight

69%

Grok 4.20 Multi-Agent Beta (low)

67%

Kimi K2.6 (none)Moonshot AI · Open weight

65%

Grok 4.20 Multi-Agent Beta (xhigh)

64%

Qwen3.6 Plus (high)Alibaba · Closed

63%

MiniMax M3 (xhigh)MiniMax · Open weight

63%

MiniMax M3 (none)MiniMax · Open weight

62%

MiMo-V2.5-Pro (xhigh)Xiaomi · Closed

62%

Qwen3.6 Plus (xhigh)Alibaba · Closed

59%

Qwen3.7 Max (xhigh)Alibaba · Closed

56%

Grok 4.20 Beta (low)

56%

Claude Fable 5 (xhigh)Anthropic · Closed

54%

Grok 4.5 (low)xAI · Closed

54%

Grok 4.5 (high)xAI · Closed

54%

Grok 4.20 Beta (xhigh)

54%

Nemotron 3 Super 120B A12B (xhigh)NVIDIA · Open weight

54%

GPT-5.6 Terra (max)OpenAI · Closed

53%

Kimi K2.5 (none)Moonshot AI · Open weight

52%

Grok 4.3 (minimal)xAI · Closed

50%

Kimi K2.6 (xhigh)Moonshot AI · Open weight

50%

anthropic/claude-3.5-haikuAnthropic

50%

anthropic/claude-3.7-sonnet:thinkingAnthropic

49%

nvidia/nemotron-3-ultra-550b-a55b (none)NVIDIA

49%

GPT-5.4 (none)OpenAI · Closed

48%

Gemini 3 Pro (low)Google · Closed

48%

GPT-5.6 Sol (low)OpenAI · Closed

47%

GPT-5.5 (xhigh)OpenAI · Closed

47%

Nemotron 3 Super 120B A12B (high)NVIDIA · Open weight

47%

GPT-5.6 Sol (max)OpenAI · Closed

46%

GPT-5.6 Terra (low)OpenAI · Closed

46%

Qwen3.6 Plus (none)Alibaba · Closed

46%

Grok 4.3 (xhigh)xAI · Closed

46%

GPT-5.5 (none)OpenAI · Closed

45%

GPT-5.5 (low)OpenAI · Closed

45%

GPT-5.2-Codex (low)OpenAI · Closed

45%

Claude 3.5 SonnetAnthropic · Closed

45%

GPT-5.1OpenAI · Closed

45%

Claude Fable 5 (low)Anthropic · Closed

44%

Claude 4.1 Opus (none)Anthropic · Closed

43%

anthropic/claude-3.7-sonnetAnthropic

43%

Nemotron 3 Super 120B A12B (none)NVIDIA · Open weight

43%

openrouter/hunter-alpha (none)Stealth

43%

GPT-5.4 (xhigh)OpenAI · Closed

42%

Claude 4.1 Opus (high)Anthropic · Closed

42%

GPT-5.6 Luna (max)OpenAI · Closed

40%

GPT-5.3 InstantOpenAI · Closed

40%

nvidia/nemotron-3-ultra-550b-a55b (xhigh)NVIDIA

40%

GPT-5 Codex

39%

GPT-5.2-Codex (xhigh)OpenAI · Closed

39%

GPT-5.2 (none)OpenAI · Closed

38%

MiMo-V2.5-Pro (none)Xiaomi · Closed

38%

Gemini 3.1 Pro (low)Google · Closed

37%

GPT-5.2-Codex (high)OpenAI · Closed

37%

openrouter/healer-alpha (none)Stealth

37%

GPT-5.5 Pro (xhigh)OpenAI · Closed

36%

GPT-5.6 Luna (low)OpenAI · Closed

36%

Gemini 3 Pro Deep Think (high)Google · Closed

36%

MiMo-V2.5 (xhigh)Xiaomi · Closed

35%

openrouter/hunter-alpha (xhigh)Stealth

35%

GPT-5.5 Pro (medium)OpenAI · Closed

34%

GPT-5.5OpenAI · Closed

34%

Claude Opus 4

34%

GPT-5.4 mini (high)OpenAI · Closed

32%

GPT-5.4 mini (none)OpenAI · Closed

32%

GPT-5.1-Codex-MaxOpenAI · Closed

32%

GPT-5.4 mini (xhigh)OpenAI · Closed

31%

Kimi K2.5 (Reasoning) (high)Moonshot AI · Closed

31%

Gemini 3.1 Pro (high)Google · Closed

31%

GLM-5-Turbo (high)Z.AI · Closed

31%

Nemotron 3 Super 120B A12B (none)NVIDIA · Open weight

31%

GLM-5.2 (xhigh)Z.AI · Open weight

31%

Claude 4 Sonnet (high)Anthropic · Closed

30%

Claude 4 Sonnet (none)Anthropic · Closed

29%

GPT-5.2 (high)OpenAI · Closed

28%

Llama 4 MaverickMeta · Open weight

28%

GLM-5 (Reasoning) (high)Z.AI · Open weight

28%

Nemotron 3 Nano 30B A3B (none)

28%

GPT-5.2 InstantOpenAI · Closed

27%

100

o3OpenAI · Closed

26%

101

openrouter/healer-alpha (xhigh)Stealth

26%

102

GPT-5.1OpenAI · Closed

25%

103

Gemma 4 31B (high)Google · Open weight

25%

104

GPT-5.3 Codex (low)OpenAI · Closed

24%

105

MiMo-V2.5 (none)Xiaomi · Closed

24%

106

GLM-5-Turbo (none)Z.AI · Closed

23%

107

GLM-5.1 (xhigh)Z.AI · Open weight

22%

108

Step 3.5 Flash (xhigh)StepFun · Open weight

22%

109

GPT-5

21%

110

Gemma 4 26B A4B (xhigh)Google · Open weight

21%

111

GPT-5.3 Codex (high)OpenAI · Closed

20%

112

Qwen3 Coder 480B A35B

20%

113

Gemini 2.5 ProGoogle · Closed

20%

114

GLM-5 (none)Z.AI · Open weight

20%

115

Gemma 4 31B (none)Google · Open weight

20%

116

Gemini 3.5 Flash (xhigh)Google · Closed

20%

117

GPT-5.3 Codex (xhigh)OpenAI · Closed

19%

118

Grok 4.1 Fast (high)xAI · Closed

19%

119

Llama 4 ScoutMeta · Open weight

19%

120

Gemini 2.5 FlashGoogle · Closed

19%

121

Gemini 3.5 Flash (minimal)Google · Closed

19%

122

GPT-5

18%

123

DeepSeek V4 Flash (none)DeepSeek · Open weight

18%

124

GLM-5.1 (none)Z.AI · Open weight

18%

125

GLM-5.2 (none)Z.AI · Open weight

17%

126

Trinity-Large-Thinking (minimal)Arcee AI · Open weight

17%

127

MiMo-V2-Flash (none)Xiaomi · Open weight

16%

128

Hy3 (none)Tencent · Open weight

16%

129

google/gemini-2.0-flash-001Google

15%

130

DeepSeek V4 Pro (xhigh)DeepSeek · Open weight

14%

131

meta-llama/llama-3.1-8b-instructMeta

14%

132

GPT-5.4 nano (high)OpenAI · Closed

14%

133

DeepSeek V4 Pro (none)DeepSeek · Open weight

14%

134

GPT-4.1OpenAI · Closed

14%

135

DeepSeek V4 Flash (xhigh)DeepSeek · Open weight

14%

136

GPT-5.4 nano (none)OpenAI · Closed

13%

137

DeepSeek V3.2 (Thinking) (high)DeepSeek · Open weight

13%

138

Step 3.5 Flash (minimal)StepFun · Open weight

13%

139

Trinity-Large-Thinking (xhigh)Arcee AI · Open weight

13%

140

MiMo-V2-Flash (high)Xiaomi · Open weight

13%

141

openai/gpt-4o-2024-08-06OpenAI

12%

142

Gemma 4 26B A4B (none)Google · Open weight

11%

143

Gemini 3.1 Flash-LiteGoogle · Closed

11%

144

Seed 1.6 (none)ByteDance · Closed

11%

145

GPT-OSS 120B (low)OpenAI · Open weight

11%

146

baidu/ernie-4.5-vl-424b-a47b (xhigh)Baidu

11%

147

GPT-5.4 nano (xhigh)OpenAI · Closed

10%

148

Gemini 3 Flash (high)Google · Closed

10%

149

DeepSeek V3.2 (none)DeepSeek · Open weight

10%

150

Claude 3 HaikuAnthropic · Closed

10%

151

Gemini 3 Flash (none)Google · Closed

10%

152

nvidia/nemotron-3-nano-30b-a3b:free (xhigh)NVIDIA

10%

153

Kimi K2Moonshot AI · Closed

10%

154

Grok 4.1 Fast (none)xAI · Closed

10%

155

MiniMax M2.5 (low)MiniMax · Closed

156

Hy3 (xhigh)Tencent · Open weight

157

MiniMax M2.5 (high)MiniMax · Closed

158

GLM-4.5 (xhigh)Z.AI · Closed

159

MiniMax M2.7 (high)MiniMax · Open weight

160

DeepSeek-R1 (xhigh)DeepSeek · Open weight

161

o4-mini (high) (low)OpenAI · Closed

162

Seed 1.6 (high)ByteDance · Closed

163

MiniMax M2.7 (low)MiniMax · Open weight

164

DeepSeek-R1 (none)DeepSeek · Open weight

165

prime-intellect/intellect-3 (low)Prime Intellect

166

mistralai/mistral-small-2603 (high)Mistral

167

qwen/qwen3-235b-a22b (none)Alibaba

168

GLM-4.5 (none)Z.AI · Closed

169

GPT-OSS 120B (high)OpenAI · Open weight

170

nvidia/nemotron-nano-9b-v2:free (none)NVIDIA

171

prime-intellect/intellect-3 (high)Prime Intellect

172

ai21/jamba-large-1.7AI21 Labs

173

o4-mini (high) (high)OpenAI · Closed

174

baidu/ernie-4.5-300b-a47bBaidu

175

deepseek/deepseek-chatDeepSeek

176

mistralai/mistral-small-2603 (none)Mistral

177

baidu/ernie-4.5-vl-424b-a47b (none)Baidu

178

qwen/qwen3-235b-a22b (xhigh)Alibaba

179

nvidia/nemotron-nano-9b-v2:free (xhigh)NVIDIA

180

google/gemma-3-27b-itGoogle

181

mistralai/mistral-large-2512Mistral

182

openai/gpt-4o-mini-2024-07-18OpenAI

The published BullshitBench v2 snapshot places Claude Opus 4.8 (none) first at 95%. The third row is 4.0 points behind. The broader top-10 range is 16.0 points, so the table still separates the published systems.

182 models have been evaluated on BullshitBench v2. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. BullshitBench v2 is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About BullshitBench v2

Year

2025

Tasks

Nonsensical and flawed prompts across multiple domains

Format

Prompt challenge and refusal evaluation

Difficulty

Robustness and critical reasoning

BullshitBench evaluates a crucial real-world capability: knowing when NOT to answer. Models that score highly recognize flawed premises, impossible physics scenarios, and logical contradictions rather than hallucinating plausible-sounding responses. V2 includes harder and more diverse challenge categories.

BullshitBench: Measuring whether AI models challenge nonsensical prompts Public benchmark source

BenchLM freshness & provenance

Version

BullshitBench v2 2025

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does BullshitBench v2 measure?

Which model leads the published BullshitBench v2 snapshot?

Claude Opus 4.8 (none) currently leads the published BullshitBench v2 snapshot with 95% clear pushback rate. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on BullshitBench v2?

182 AI models are included in BenchLM's mirrored BullshitBench v2 snapshot, based on the public leaderboard captured on July 16, 2026 at 9:22 PM UTC.

Last updated: July 16, 2026 at 9:22 PM UTC · mirrored from the public benchmark leaderboard

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.