Skip to main content

Tau2-Telecom

A telecom-oriented tool benchmark that measures structured tool use in domain workflows.

Benchmark score on Tau2-Telecom — June 2, 2026

BenchLM mirrors the published score view for Tau2-Telecom. Step 3.7 Flash leads the public snapshot at 98.5% , followed by GLM-5V-Turbo (98.5%) and GLM-5-Turbo (98.5%). BenchLM does not use these results to rank models overall.

117 modelsAgenticCurrentDisplay onlyUpdated June 2, 2026

The published Tau2-Telecom snapshot is tightly clustered at the top: Step 3.7 Flash sits at 98.5%, while the third row is only 0.0 points behind. The broader top-10 spread is 2.6 points, so many of the published scores sit in a relatively narrow band.

117 models have been evaluated on Tau2-Telecom. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. Tau2-Telecom is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About Tau2-Telecom

Year

2026

Tasks

Telecom tool workflows

Format

Domain-specific tool evaluation

Difficulty

Professional workflow

OpenAI reports tau2-bench as a domain-specific tool benchmark for telecom tasks, useful for measuring API-call reliability under constraints.

BenchLM freshness & provenance

Version

τ²-Bench 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

Benchmark score table (117 models)

1
98.5%
2
98.5%
3
98.5%
4
98.2%
5
97.7%
6
97.7%
7
97.7%
8
96.2%
9
95.9%
10
95.9%
11
95.9%
12
95.9%
13
95.9%
14
95.6%
15
95.6%
16
95.6%
17
95.3%
18
95.3%
19
95%
20
95%
21
94.7%
22
94.4%
23
94.2%
24
94.2%
25
94.2%
26
94.2%
27
93.9%
28
93.9%
29
93.6%
31
92.7%
32
92.1%
33
92.1%
34
91.5%
35
91.2%
36
90.1%
37
90.1%
38
89.5%
39
89.2%
40
88.6%
41
87.1%
42
87.1%
43
86.5%
44
86.3%
45
86%
46
86%
47
84.8%
48
84.8%
49
84.8%
50
84.8%
51
83.9%
52
83.9%
53
83.3%
54
83%
55
83%
56
81.9%
57
80.7%
58
80.7%
59
79.5%
60
78.9%
61
76.9%
62
76%
63
75.7%
64
74.9%
65
74.3%
66
74.3%
67
74%
68
71.4%
69
65.8%
70
65.8%
71
63.7%
72
62.6%
73
61.1%
74
60.2%
75
59.9%
76
54.1%
77
52.9%
78
52.3%
79
47.1%
80
46.8%
81
46.5%
83
43.6%
84
43.3%
85
41.2%
86
41.2%
87
37.4%
88
36.5%
89
34.8%
90
34.5%
91
31.9%
92
31.3%
93
30.7%
94
28.7%
95
25.4%
96
25.1%
97
24.6%
98
24.3%
99
22.8%
100
22.8%
101
21.1%
102
20.8%
103
20.8%
104
20.5%
105
19.6%
106
19%
107
17.8%
108
17.3%
109
15.5%
110
14.9%
111
14.6%
112
14%
113
13.2%
114
11.4%
115
10.5%
116
4.1%
117
0%

FAQ

What does Tau2-Telecom measure?

A telecom-oriented tool benchmark that measures structured tool use in domain workflows.

Which model scores highest on Tau2-Telecom?

Step 3.7 Flash by StepFun currently leads with a score of 98.5% on Tau2-Telecom.

How many models are evaluated on Tau2-Telecom?

117 AI models have been evaluated on Tau2-Telecom on BenchLM.

Last updated: June 2, 2026 · BenchLM version τ²-Bench 2026

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.