Skip to main content

Terminal-Bench Hard

A display-only Artificial Analysis coding metric for agentic coding and terminal use on a harder Terminal-Bench slice.

Benchmark score on Terminal-Bench Hard — June 2, 2026

BenchLM mirrors the published score view for Terminal-Bench Hard. GPT-5.5 leads the public snapshot at 60.6% , followed by Claude Opus 4.8 (58.3%) and GPT-5.4 (57.6%). BenchLM does not use these results to rank models overall.

117 modelsCodingCurrentDisplay onlyUpdated June 2, 2026

The published Terminal-Bench Hard snapshot is tightly clustered at the top: GPT-5.5 sits at 60.6%, while the third row is only 3.0 points behind. The broader top-10 spread is 12.1 points, so the benchmark still separates strong models even when the leaders cluster.

117 models have been evaluated on Terminal-Bench Hard. The benchmark falls in the Coding category. This category carries a 20% weight in BenchLM.ai's overall scoring system. Terminal-Bench Hard is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About Terminal-Bench Hard

Year

2026

Tasks

Agentic coding and terminal tasks

Format

Task success rate

Difficulty

Professional software engineering

BenchLM stores Terminal-Bench Hard separately from Terminal-Bench 2.0 because OpenRouter and Artificial Analysis publish it as a distinct benchmark card.

BenchLM freshness & provenance

Version

Terminal-Bench Hard 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Benchmark score table (117 models)

1
60.6%
2
58.3%
3
57.6%
4
54.5%
5
53.8%
6
53.0%
7
52.3%
8
51.5%
9
50.8%
10
48.5%
11
47.0%
12
47.0%
13
46.2%
14
46.2%
15
46.2%
16
45.5%
17
45.5%
18
43.9%
19
43.9%
20
43.9%
21
43.2%
22
43.2%
23
43.2%
24
42.4%
25
41.7%
26
41.7%
27
40.9%
28
40.9%
29
40.9%
30
40.9%
31
39.4%
32
38.6%
33
37.9%
34
37.9%
35
37.9%
36
37.1%
37
37.1%
38
36.4%
39
35.6%
40
35.6%
41
35.6%
42
34.8%
43
34.8%
44
34.8%
45
34.8%
46
34.8%
47
34.8%
48
34.8%
49
34.3%
50
34.1%
51
33.3%
52
33.3%
53
32.6%
54
32.6%
55
32.6%
56
32.6%
57
31.8%
58
31.8%
59
31.1%
60
28.8%
61
27.3%
62
26.5%
63
26.5%
64
25.8%
65
25.0%
66
25.0%
67
24.2%
68
24.2%
70
23.5%
71
22.7%
72
22.7%
73
22.7%
74
21.2%
75
20.5%
76
20.5%
77
18.9%
78
17.4%
79
17.4%
80
17.4%
81
15.9%
82
15.9%
83
15.9%
84
14.4%
85
13.6%
86
13.6%
87
12.9%
88
12.1%
89
12.1%
90
10.6%
91
8.3%
93
8.3%
94
7.6%
95
6.8%
96
6.8%
97
6.8%
98
6.8%
99
6.1%
100
6.1%
101
4.5%
102
3.8%
103
3.8%
104
3.8%
105
3.8%
106
3.0%
107
2.3%
108
2.3%
109
1.5%
110
1.5%
111
1.5%
112
0.8%
113
0.0%
114
0.0%
115
0.0%
116
0.0%
117
0.0%

FAQ

What does Terminal-Bench Hard measure?

A display-only Artificial Analysis coding metric for agentic coding and terminal use on a harder Terminal-Bench slice.

Which model scores highest on Terminal-Bench Hard?

GPT-5.5 by OpenAI currently leads with a score of 60.6% on Terminal-Bench Hard.

How many models are evaluated on Terminal-Bench Hard?

117 AI models have been evaluated on Terminal-Bench Hard on BenchLM.

Last updated: June 2, 2026 · BenchLM version Terminal-Bench Hard 2026

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.