Benchmark profile

VITA-Bench

An interactive real-world agent benchmark grounded in practical consumer-service tasks such as delivery, in-store consumption, and online travel workflows.

Data verified July 23, 2026

Benchmark score on VITA-Bench — July 23, 2026

BenchLM mirrors the published score view for VITA-Bench. Qwen3.7 Max leads the public snapshot at 47.9% , followed by Qwen3.7 Plus (45.6%) and Qwen3.6 Plus (44.3%). BenchLM does not use these results to rank models overall.

1Closed

Qwen3.7 Max

Alibaba

qwen3-7-max

47.9%

Overall 72.84Context 1M

2Closed

Qwen3.7 Plus

Alibaba

qwen3-7-plus

45.6%

Overall 67.22Context 1M

3Closed

Qwen3.6 Plus

Alibaba

qwen3-6-plus

44.3%

Overall 65.2Context 1M

10 modelsAgenticCurrentDisplay onlyUpdated July 23, 2026

Benchmark score table (10 models)

Score

Qwen3.7 MaxAlibaba · Closed

47.9%

Qwen3.7 PlusAlibaba · Closed

45.6%

Qwen3.6 PlusAlibaba · Closed

44.3%

Qwen3.5 397BAlibaba · Open weight

43.7%

Agents-A1InternScience · Open weight

38.8%

Qwen3.6-35B-A3BAlibaba · Open weight

35.6%

Claude Opus 4.5Anthropic · Closed

23.3%

DeepSeek V3.2DeepSeek · Open weight

18.5%

Claude Sonnet 4.5Anthropic · Closed

17.0%

GLM-4.7Z.AI · Open weight

15.5%

The published VITA-Bench snapshot places Qwen3.7 Max first at 47.9%. The third row is 3.6 points behind. The broader top-10 range is 32.4 points, so the table still separates the published systems.

10 models have been evaluated on VITA-Bench. The benchmark falls in the Agentic category. This category carries a 22% weight in BenchLM.ai's overall scoring system. VITA-Bench is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About VITA-Bench

Year

2025

Tasks

Interactive consumer-service agent tasks

Format

End-to-end interactive agent evaluation

Difficulty

Long-horizon real-world workflows

VITA-Bench is built to test realistic interactive agent behavior rather than toy tool calls. It stresses long-horizon coordination, tool selection, changing user intent, and domain switching across daily-life applications.

VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications

BenchLM freshness & provenance

Version

VITA-Bench 2025

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does VITA-Bench measure?

An interactive real-world agent benchmark grounded in practical consumer-service tasks such as delivery, in-store consumption, and online travel workflows.

Which model scores highest on VITA-Bench?

Qwen3.7 Max by Alibaba currently leads with a score of 47.9% on VITA-Bench.

How many models are evaluated on VITA-Bench?

10 AI models have been evaluated on VITA-Bench on BenchLM.

Compare Top Models on VITA-Bench

Qwen3.7 Max vs Qwen3.7 Plus Qwen3.7 Plus vs Qwen3.6 Plus Qwen3.6 Plus vs Qwen3.5 397B Qwen3.5 397B vs Agents-A1

Last updated: July 23, 2026 · BenchLM version VITA-Bench 2025

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.