A benchmark for grounded reasoning over office-style documents, spreadsheets, charts, and business artifacts.
As of June 2, 2026, Claude Opus 4.8 leads the OfficeQA Pro leaderboard with 66.2% , followed by GPT-5.5 (54.1%) and GPT-5.4 (53.2%).
Claude Opus 4.8
Anthropic
GPT-5.5
OpenAI
GPT-5.4
OpenAI
According to BenchLM.ai, Claude Opus 4.8 leads the OfficeQA Pro benchmark with a score of 66.2%, followed by GPT-5.5 (54.1%) and GPT-5.4 (53.2%). There is significant spread across the leaderboard, making this benchmark effective at differentiating model capabilities.
5 models have been evaluated on OfficeQA Pro. The benchmark falls in the Multimodal & Grounded category. This category carries a 12% weight in BenchLM.ai's overall scoring system. Within that category, OfficeQA Pro contributes 30% of the category score, so strong performance here directly affects a model's overall ranking.
Year
2026
Tasks
Document and spreadsheet tasks
Format
Grounded QA over office artifacts
Difficulty
Enterprise grounded reasoning
OfficeQA Pro is useful when choosing models for enterprise copilots because it measures whether they can reason correctly over real office content rather than generic chat prompts.
Version
OfficeQA Pro 2026
Refresh cadence
Quarterly
Staleness state
Current
Question availability
Public benchmark set
BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.
A benchmark for grounded reasoning over office-style documents, spreadsheets, charts, and business artifacts.
Claude Opus 4.8 by Anthropic currently leads with a score of 66.2% on OfficeQA Pro.
5 AI models have been evaluated on OfficeQA Pro on BenchLM.
For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.
Free. No spam. Unsubscribe anytime.