Skip to main content

BullshitBench v2

A benchmark that tests whether AI models challenge nonsensical, ill-posed, or logically flawed prompts instead of confidently generating incorrect answers. Measures the critical ability to push back on bad input.

How BenchLM shows BullshitBench v2

BenchLM mirrors the published BullshitBench v2 leaderboard using the official snapshot generated on June 1, 2026 at 5:53 AM UTC. The public view reports per-model clear-pushback rates across 100 nonsense prompts, scored by a 3-judge panel.

BullshitBench is a useful reasoning sanity check, but BenchLM currently keeps it display only rather than weighted. The public leaderboard is highly variant-specific and exposes reasoning-effort settings directly, so BenchLM treats it as a mirrored external benchmark instead of a canonical ranking input.

164 model variants96 base models100 nonsense prompts3 judgesDisplay only

Clear pushback rate on BullshitBench v2 — June 1, 2026 at 5:53 AM UTC

BenchLM mirrors the published clear pushback rate view for BullshitBench v2. Claude Opus 4.8 (none) leads the public snapshot at 95% , followed by Claude Opus 4.8 (xhigh) (94%) and Claude Sonnet 4.6 (high) (91%). BenchLM does not use these results to rank models overall.

164 modelsReasoningCurrentDisplay onlyUpdated June 1, 2026 at 5:53 AM UTC

The published BullshitBench v2 snapshot is tightly clustered at the top: Claude Opus 4.8 (none) sits at 95%, while the third row is only 4.0 points behind. The broader top-10 spread is 16.0 points, so the benchmark still separates strong models even when the leaders cluster.

164 models have been evaluated on BullshitBench v2. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. BullshitBench v2 is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About BullshitBench v2

Year

2025

Tasks

Nonsensical and flawed prompts across multiple domains

Format

Prompt challenge and refusal evaluation

Difficulty

Robustness and critical reasoning

BullshitBench evaluates a crucial real-world capability: knowing when NOT to answer. Models that score highly recognize flawed premises, impossible physics scenarios, and logical contradictions rather than hallucinating plausible-sounding responses. V2 includes harder and more diverse challenge categories.

BenchLM freshness & provenance

Version

BullshitBench v2 2025

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Clear pushback rate table (164 models)

1
Claude Opus 4.8 (none)anthropic/claude-opus-4.8@reasoning=none
95%
2
Claude Opus 4.8 (xhigh)anthropic/claude-opus-4.8@reasoning=xhigh
94%
3
Claude Sonnet 4.6 (high)anthropic/claude-sonnet-4.6@reasoning=high
91%
4
Claude Opus 4.5 (high)anthropic/claude-opus-4.5@reasoning=high
90%
5
Claude Sonnet 4.6 (none)anthropic/claude-sonnet-4.6@reasoning=none
89%
6
Claude Opus 4.6 (high)anthropic/claude-opus-4.6@reasoning=high
87%
7
Claude Opus 4.6 (none)anthropic/claude-opus-4.6@reasoning=none
83%
8
Claude Opus 4.7 (none)anthropic/claude-opus-4.7@reasoning=none
83%
9
Claude Sonnet 4.5 (high)anthropic/claude-sonnet-4.5@reasoning=high
79%
10
Claude Opus 4.5 (none)anthropic/claude-opus-4.5@reasoning=none
79%
11
Qwen3.5 397B (Reasoning) (high)qwen/qwen3.5-397b-a17b@reasoning=high
78%
12
Claude Haiku 4.5 (high)anthropic/claude-haiku-4.5@reasoning=high
77%
13
Claude Opus 4.7 (max)anthropic/claude-opus-4.7@reasoning=max
74%
14
Claude Sonnet 4.5 (none)anthropic/claude-sonnet-4.5@reasoning=none
74%
15
Qwen3.6 Plus (none)qwen/qwen3.6-plus@reasoning=none
72%
16
Claude Haiku 4.5 (none)anthropic/claude-haiku-4.5@reasoning=none
71%
17
Qwen3.7 Max (none)qwen/qwen3.7-max@reasoning=none
71%
18
Qwen3.5 397B (none)qwen/qwen3.5-397b-a17b@reasoning=none
69%
19
Grok 4.20 Multi-Agent Beta (low)x-ai/grok-4.20-multi-agent-beta@reasoning=low
67%
20
Kimi K2.6 (none)moonshotai/kimi-k2.6@reasoning=none
65%
21
Grok 4.20 Multi-Agent Beta (xhigh)x-ai/grok-4.20-multi-agent-beta@reasoning=xhigh
64%
22
Qwen3.6 Plus (high)qwen/qwen3-max-thinking@reasoning=high
63%
23
MiniMax M3 (xhigh)minimax/minimax-m3@reasoning=xhigh
63%
24
MiniMax M3 (none)minimax/minimax-m3@reasoning=none
62%
25
MiMo-V2.5-Pro (xhigh)xiaomi/mimo-v2.5-pro@reasoning=xhigh
62%
26
Qwen3.6 Plus (xhigh)qwen/qwen3.6-plus@reasoning=xhigh
59%
27
Qwen3.7 Max (xhigh)qwen/qwen3.7-max@reasoning=xhigh
56%
28
Grok 4.20 Beta (low)x-ai/grok-4.20-beta@reasoning=low
56%
29
Grok 4.20 Beta (xhigh)x-ai/grok-4.20-beta@reasoning=xhigh
54%
30
Nemotron 3 Super 120B A12B (xhigh)nvidia/nemotron-3-super-120b-a12b:free@reasoning=xhigh
54%
31
Kimi K2.5 (none)moonshotai/kimi-k2.5@reasoning=none
52%
32
Grok 4.3 (minimal)x-ai/grok-4.3@reasoning=minimal
50%
33
Kimi K2.6 (xhigh)moonshotai/kimi-k2.6@reasoning=xhigh
50%
34
anthropic/claude-3.5-haikuanthropic/claude-3.5-haiku@reasoning=default
50%
35
anthropic/claude-3.7-sonnet:thinkinganthropic/claude-3.7-sonnet:thinking@reasoning=default
49%
36
GPT-5.4 (none)openai/gpt-5.4@reasoning=none
48%
37
Gemini 3 Pro (low)google/gemini-3-pro-preview@reasoning=low
48%
38
GPT-5.5 (xhigh)openai/gpt-5.5@reasoning=xhigh
47%
39
Nemotron 3 Super 120B A12B (high)nvidia/nemotron-3-super-120b-a12b@reasoning=high
47%
40
Qwen3.6 Plus (none)qwen/qwen3-max-thinking@reasoning=none
46%
41
Grok 4.3 (xhigh)x-ai/grok-4.3@reasoning=xhigh
46%
42
GPT-5.5 (none)openai/gpt-5.5@reasoning=none
45%
43
GPT-5.5 (low)openai/gpt-5.5@reasoning=low
45%
44
GPT-5.2-Codex (low)openai/gpt-5.2-codex@reasoning=low
45%
45
Claude 3.5 Sonnetanthropic/claude-3.5-sonnet@reasoning=default
45%
46
GPT-5.1openai/gpt-5.1-chat@reasoning=default
45%
47
Claude 4.1 Opus (none)anthropic/claude-opus-4.1@reasoning=none
43%
48
anthropic/claude-3.7-sonnetanthropic/claude-3.7-sonnet@reasoning=default
43%
49
Nemotron 3 Super 120B A12B (none)nvidia/nemotron-3-super-120b-a12b:free@reasoning=none
43%
50
openrouter/hunter-alpha (none)openrouter/hunter-alpha@reasoning=none
43%
51
GPT-5.4 (xhigh)openai/gpt-5.4@reasoning=xhigh
42%
52
Claude 4.1 Opus (high)anthropic/claude-opus-4.1@reasoning=high
42%
53
GPT-5.3 Instantopenai/gpt-5.3-chat@reasoning=default
40%
54
GPT-5 Codexopenai/gpt-5-codex@reasoning=default
39%
55
GPT-5.2-Codex (xhigh)openai/gpt-5.2-codex@reasoning=xhigh
39%
56
GPT-5.2 (none)openai/gpt-5.2@reasoning=none
38%
57
MiMo-V2.5-Pro (none)xiaomi/mimo-v2.5-pro@reasoning=none
38%
58
Gemini 3.1 Pro (low)google/gemini-3.1-pro-preview@reasoning=low
37%
59
GPT-5.2-Codex (high)openai/gpt-5.2-codex@reasoning=high
37%
60
openrouter/healer-alpha (none)openrouter/healer-alpha@reasoning=none
37%
61
GPT-5.5 Pro (xhigh)openai/gpt-5.5-pro@reasoning=xhigh
36%
62
Gemini 3 Pro Deep Think (high)google/gemini-3-pro-preview@reasoning=high
36%
63
MiMo-V2.5 (xhigh)xiaomi/mimo-v2.5@reasoning=xhigh
35%
64
openrouter/hunter-alpha (xhigh)openrouter/hunter-alpha@reasoning=xhigh
35%
65
GPT-5.5 Pro (medium)openai/gpt-5.5-pro@reasoning=medium
34%
66
GPT-5.5openai/gpt-5.5-chat@reasoning=default
34%
67
Claude Opus 4anthropic/claude-opus-4@reasoning=default
34%
68
GPT-5.4 mini (high)openai/gpt-5.4-mini@reasoning=high
32%
69
GPT-5.4 mini (none)openai/gpt-5.4-mini@reasoning=none
32%
70
GPT-5.1-Codex-Maxopenai/gpt-5.1-codex@reasoning=default
32%
71
GPT-5.4 mini (xhigh)openai/gpt-5.4-mini@reasoning=xhigh
31%
72
Kimi K2.5 (Reasoning) (high)moonshotai/kimi-k2.5@reasoning=high
31%
73
Gemini 3.1 Pro (high)google/gemini-3.1-pro-preview@reasoning=high
31%
74
GLM-5-Turbo (high)z-ai/glm-5-turbo@reasoning=high
31%
75
Nemotron 3 Super 120B A12B (none)nvidia/nemotron-3-super-120b-a12b@reasoning=none
31%
76
Claude 4 Sonnet (high)anthropic/claude-sonnet-4@reasoning=high
30%
77
Claude 4 Sonnet (none)anthropic/claude-sonnet-4@reasoning=none
29%
78
GPT-5.2 (high)openai/gpt-5.2@reasoning=high
28%
79
Llama 4 Maverickmeta-llama/llama-4-maverick@reasoning=default
28%
80
GLM-5 (Reasoning) (high)z-ai/glm-5@reasoning=high
28%
81
Nemotron 3 Nano 30B A3B (none)nvidia/nemotron-3-nano-30b-a3b:free@reasoning=none
28%
82
GPT-5.2 Instantopenai/gpt-5.2-chat@reasoning=default
27%
83
o3openai/o3@reasoning=default
26%
84
openrouter/healer-alpha (xhigh)openrouter/healer-alpha@reasoning=xhigh
26%
85
GPT-5.1openai/gpt-5.1@reasoning=default
25%
86
Gemma 4 31B (high)google/gemma-4-31b-it@reasoning=high
25%
87
GPT-5.3 Codex (low)openai/gpt-5.3-codex@reasoning=low
24%
88
MiMo-V2.5 (none)xiaomi/mimo-v2.5@reasoning=none
24%
89
GLM-5-Turbo (none)z-ai/glm-5-turbo@reasoning=none
23%
90
GLM-5.1 (xhigh)z-ai/glm-5.1@reasoning=xhigh
22%
91
Step 3.5 Flash (xhigh)stepfun/step-3.5-flash@reasoning=xhigh
22%
92
GPT-5openai/gpt-5@reasoning=default
21%
93
Gemma 4 26B A4B (xhigh)google/gemma-4-26b-a4b-it@reasoning=xhigh
21%
94
GPT-5.3 Codex (high)openai/gpt-5.3-codex@reasoning=high
20%
95
Qwen3 Coder 480B A35Bqwen/qwen3-coder@reasoning=default
20%
96
Gemini 2.5 Progoogle/gemini-2.5-pro@reasoning=default
20%
97
GLM-5 (none)z-ai/glm-5@reasoning=none
20%
98
Gemma 4 31B (none)google/gemma-4-31b-it@reasoning=none
20%
99
Gemini 3.5 Flash (xhigh)google/gemini-3.5-flash@reasoning=xhigh
20%
100
GPT-5.3 Codex (xhigh)openai/gpt-5.3-codex@reasoning=xhigh
19%
101
Grok 4.1 Fast (high)x-ai/grok-4.1-fast@reasoning=high
19%
102
Llama 4 Scoutmeta-llama/llama-4-scout@reasoning=default
19%
103
Gemini 2.5 Flashgoogle/gemini-2.5-flash@reasoning=default
19%
104
Gemini 3.5 Flash (minimal)google/gemini-3.5-flash@reasoning=minimal
19%
105
GPT-5openai/gpt-5-chat@reasoning=default
18%
106
DeepSeek V4 Flash (none)deepseek/deepseek-v4-flash@reasoning=none
18%
107
GLM-5.1 (none)z-ai/glm-5.1@reasoning=none
18%
108
Trinity-Large-Thinking (minimal)arcee-ai/trinity-large-thinking@reasoning=minimal
17%
109
MiMo-V2-Flash (none)xiaomi/mimo-v2-flash@reasoning=none
16%
110
Hy3 Preview (none)tencent/hy3-preview:free@reasoning=none
16%
111
google/gemini-2.0-flash-001google/gemini-2.0-flash-001@reasoning=default
15%
112
DeepSeek V4 Pro (xhigh)deepseek/deepseek-v4-pro@reasoning=xhigh
14%
113
meta-llama/llama-3.1-8b-instructmeta-llama/llama-3.1-8b-instruct@reasoning=default
14%
114
GPT-5.4 nano (high)openai/gpt-5.4-nano@reasoning=high
14%
115
DeepSeek V4 Pro (none)deepseek/deepseek-v4-pro@reasoning=none
14%
116
GPT-4.1openai/gpt-4.1@reasoning=default
14%
117
DeepSeek V4 Flash (xhigh)deepseek/deepseek-v4-flash@reasoning=xhigh
14%
118
GPT-5.4 nano (none)openai/gpt-5.4-nano@reasoning=none
13%
119
DeepSeek V3.2 (Thinking) (high)deepseek/deepseek-v3.2@reasoning=high
13%
120
Step 3.5 Flash (minimal)stepfun/step-3.5-flash@reasoning=minimal
13%
121
Trinity-Large-Thinking (xhigh)arcee-ai/trinity-large-thinking@reasoning=xhigh
13%
122
MiMo-V2-Flash (high)xiaomi/mimo-v2-flash@reasoning=high
13%
123
openai/gpt-4o-2024-08-06openai/gpt-4o-2024-08-06@reasoning=default
12%
124
Gemma 4 26B A4B (none)google/gemma-4-26b-a4b-it@reasoning=none
11%
125
Gemini 3.1 Flash-Litegoogle/gemini-3.1-flash-lite-preview@reasoning=default
11%
126
Seed 1.6 (none)bytedance-seed/seed-1.6@reasoning=none
11%
127
GPT-OSS 120B (low)openai/gpt-oss-120b@reasoning=low
11%
128
baidu/ernie-4.5-vl-424b-a47b (xhigh)baidu/ernie-4.5-vl-424b-a47b@reasoning=xhigh
11%
129
GPT-5.4 nano (xhigh)openai/gpt-5.4-nano@reasoning=xhigh
10%
130
Gemini 3 Flash (high)google/gemini-3-flash-preview@reasoning=high
10%
131
DeepSeek V3.2 (none)deepseek/deepseek-v3.2@reasoning=none
10%
132
Claude 3 Haikuanthropic/claude-3-haiku@reasoning=default
10%
133
Gemini 3 Flash (none)google/gemini-3-flash-preview@reasoning=none
10%
134
nvidia/nemotron-3-nano-30b-a3b:free (xhigh)nvidia/nemotron-3-nano-30b-a3b:free@reasoning=xhigh
10%
135
Kimi K2moonshotai/kimi-k2@reasoning=default
10%
136
Grok 4.1 Fast (none)x-ai/grok-4.1-fast@reasoning=none
10%
137
MiniMax M2.5 (low)minimax/minimax-m2.5@reasoning=low
9%
138
Hy3 Preview (xhigh)tencent/hy3-preview:free@reasoning=xhigh
8%
139
MiniMax M2.5 (high)minimax/minimax-m2.5@reasoning=high
8%
140
GLM-4.5 (xhigh)z-ai/glm-4.5@reasoning=xhigh
8%
141
MiniMax M2.7 (high)minimax/minimax-m2.7@reasoning=high
8%
142
DeepSeek-R1 (xhigh)deepseek/deepseek-r1@reasoning=xhigh
8%
143
o4-mini (high) (low)openai/o4-mini@reasoning=low
8%
144
Seed 1.6 (high)bytedance-seed/seed-1.6@reasoning=high
7%
145
MiniMax M2.7 (low)minimax/minimax-m2.7@reasoning=low
7%
146
DeepSeek-R1 (none)deepseek/deepseek-r1@reasoning=none
7%
147
prime-intellect/intellect-3 (low)prime-intellect/intellect-3@reasoning=low
7%
148
mistralai/mistral-small-2603 (high)mistralai/mistral-small-2603@reasoning=high
6%
149
qwen/qwen3-235b-a22b (none)qwen/qwen3-235b-a22b@reasoning=none
6%
150
GLM-4.5 (none)z-ai/glm-4.5@reasoning=none
6%
151
GPT-OSS 120B (high)openai/gpt-oss-120b@reasoning=high
5%
152
nvidia/nemotron-nano-9b-v2:free (none)nvidia/nemotron-nano-9b-v2:free@reasoning=none
5%
153
prime-intellect/intellect-3 (high)prime-intellect/intellect-3@reasoning=high
5%
154
ai21/jamba-large-1.7ai21/jamba-large-1.7@reasoning=default
5%
155
o4-mini (high) (high)openai/o4-mini@reasoning=high
4%
156
baidu/ernie-4.5-300b-a47bbaidu/ernie-4.5-300b-a47b@reasoning=default
4%
157
deepseek/deepseek-chatdeepseek/deepseek-chat@reasoning=default
4%
158
mistralai/mistral-small-2603 (none)mistralai/mistral-small-2603@reasoning=none
4%
159
baidu/ernie-4.5-vl-424b-a47b (none)baidu/ernie-4.5-vl-424b-a47b@reasoning=none
3%
160
qwen/qwen3-235b-a22b (xhigh)qwen/qwen3-235b-a22b@reasoning=xhigh
3%
161
nvidia/nemotron-nano-9b-v2:free (xhigh)nvidia/nemotron-nano-9b-v2:free@reasoning=xhigh
3%
162
google/gemma-3-27b-itgoogle/gemma-3-27b-it@reasoning=default
3%
163
mistralai/mistral-large-2512mistralai/mistral-large-2512@reasoning=default
2%
164
openai/gpt-4o-mini-2024-07-18openai/gpt-4o-mini-2024-07-18@reasoning=default
2%

FAQ

What does BullshitBench v2 measure?

A benchmark that tests whether AI models challenge nonsensical, ill-posed, or logically flawed prompts instead of confidently generating incorrect answers. Measures the critical ability to push back on bad input.

Which model leads the published BullshitBench v2 snapshot?

Claude Opus 4.8 (none) currently leads the published BullshitBench v2 snapshot with 95% clear pushback rate. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on BullshitBench v2?

164 AI models are included in BenchLM's mirrored BullshitBench v2 snapshot, based on the public leaderboard captured on June 1, 2026 at 5:53 AM UTC.

Last updated: June 1, 2026 at 5:53 AM UTC · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.