Model profile

Grok 4.20

Name: Grok 4.20
Author: xAI

xAICurrentReleased Mar 10, 2026

Data verified July 23, 2026

Superseded:xAI has released newer models in this line —Grok 4.3·Grok 4.5

Grok 4.20 by xAI scores 54.68/100 on the public leaderboard (#88 of 200), with an Arena Elo of 1474 and a 2M-token context window. API pricing is $2/$6 per million input/output tokens. Newer replacements: Grok 4.3 and Grok 4.5.

Overall Score

54.68Public #88 of 200

Arena Elo

1474

Eligible category ranks

3of 8

Price (1M tokens)

$2 in / $6 out

API pricing

Speed

233tok/s

Context

Evidence coverage

18 of 323 tracked benchmarks are published. 2 are verified and 16 provisional. 5 of 8 categories are measured.

Updated July 23, 2026Methodology

Published / tracked: 18 / 323
Verified: 2
Provisional: 16
Categories with evidence: 5 / 8

Agentic3 benchmarks
Mixed evidence
Coding4 benchmarks
Mixed evidence
Reasoning1 benchmark
Reported
Knowledge4 benchmarks
Reported
Math0 benchmarks
Not measured
Multilingual0 benchmarks
Not measured
Multimodal6 benchmarks
Reported
Inst. Following0 benchmarks
Not measured

ProprietaryReasoning

Confidence:

Medium

reasoning

Grok 4.20 ranks #88 out of 200 models on the public leaderboard with an overall score of 54.68/100. It does not yet have enough sourced coverage for BenchLM's verified leaderboard. While not a frontier model, it offers specific advantages depending on the use case.

Grok 4.20 is a proprietary model with a 2M token context window. It uses explicit chain-of-thought reasoning, which typically improves performance on math and complex reasoning tasks at the cost of higher latency and token usage.

Grok 4.20 sits inside the Grok 4.20 family alongside Grok 4.20 Multi-agent. BenchLM links it directly to Grok 4.1 as the earlier related model in that lineage. This profile currently has 18 of 323 tracked benchmarks. BenchLM only exposes non-generated benchmark rows publicly, so missing categories stay blank until a sourced evaluation is available.

Its strongest category is Multimodal & Grounded (#25), while its weakest is Coding (#83). This performance profile makes it particularly strong for screenshots, documents, charts, and grounded multimodal workflows.

Peer position

Exact provisional scores and ranks for the closest listed peers. A score can appear before a model clears the evidence threshold for a rank, so equal scores can have different rank states.

Range 54.48–55.15

GPT-5 (medium)
OpenAI
#8455.15
GPT-5 (medium) is #84 with a score of 55.15.
Compare
GLM-4.6
Z.AI
#8555.12
GLM-4.6 is #85 with a score of 55.12.
Compare
Step 3.5 Flash
StepFun
#8655.1
Step 3.5 Flash is #86 with a score of 55.1.
Compare
Kimi K2.7 Code
Moonshot AI
#8755.0
Kimi K2.7 Code is #87 with a score of 55.0.
Compare
Grok 4.20Current model
xAI
#8854.68
Grok 4.20 is #88 with a score of 54.68.
DeepSeek LLM 2.0
DeepSeek
#8954.53
DeepSeek LLM 2.0 is #89 with a score of 54.53.
Compare
GPT-5.1-Codex-Max
OpenAI
#9054.48
GPT-5.1-Codex-Max is #90 with a score of 54.48.
Compare

Category percentile

Relative position among models eligible for each sourced category. A higher percentile means a stronger position within that category's ranked cohort; 100 is highest.

Multimodal14%
Eligible cohort rank #25 of 29Category score 34.9
Agentic59%
Eligible cohort rank #49 of 119Category score 49.6
Coding32%
Eligible cohort rank #83 of 122Category score 47.0

Category evidence

Scores and ranks appear only where this model has published benchmark evidence. Categories without displayable source records remain not measured.

Category scores, ranks, weighting, benchmark coverage, and evidence status
Category	Score	Rank	Percentile	Weight	Benchmarks	Evidence
AgenticRank #49 of 119Percentile 59thWeight 22%3 benchmarksMixed sources	49.6	#49 of 119	59th	22%	3 benchmarks	Mixed sources
CodingRank #83 of 122Percentile 32ndWeight 20%4 benchmarksMixed sources	47.0	#83 of 122	32nd	20%	4 benchmarks	Mixed sources
ReasoningRank Not rankedWeight 17%1 benchmarkReported	65.2	Not ranked	Not available	17%	1 benchmark	Reported
KnowledgeWeight 12%4 benchmarksReported	Score pending	Not ranked	Not available	12%	4 benchmarks	Reported
MathWeight 5%0 benchmarksNot measured	Not measured	Not ranked	Not available	5%	0 benchmarks	Not measured
MultilingualWeight 7%0 benchmarksNot measured	Not measured	Not ranked	Not available	7%	0 benchmarks	Not measured
MultimodalRank #25 of 29Percentile 14thWeight 12%6 benchmarksReported	34.9	#25 of 29	14th	12%	6 benchmarks	Reported
Inst. FollowingWeight 5%0 benchmarksNot measured	Not measured	Not ranked	Not available	5%	0 benchmarks	Not measured

Chatbot Arena performance

Scroll horizontally to inspect confidence intervals and vote counts.

Chatbot Arena Elo, confidence interval, and vote count by evaluation view
View	Elo	Confidence interval	Votes
Text Overall	1474	±4.7	26,827
Coding	1508	±7.7	7,213
Math	1452	±14.8	1,601
Instruction Following	1449	±7.2	8,360
Creative Writing	1460	±10.0	3,970
Multi-turn	1482	±9.6	4,384
Hard Prompts	1486	±5.6	16,642
Hard Prompts (English)	1487	±7.2	8,378
Longer Query	1463	±7.0	9,785

Benchmark Details

Rows below have a displayable published verification record. Each source link and provenance note remains in the page HTML while its category is closed. Source-unverified manual rows and generated rows stay hidden.

Agentic3 benchmarks

Terminal-Bench 2.0Secondary exact

47.1%Weighted 38%

Source: Meta AI: Muse Spark comparison chartProvenance: Secondary exact

DeepSearchQASecondary exact

62.8%Display only

Source: Meta AI: Muse Spark comparison chartProvenance: Secondary exact

Gert LabsBenchmark exact

Gert Labs Composite Game Benchmark

38.36%Display only

Source: Gert Labs rankingsProvenance: Gert Labs reports this composite leaderboard score in the public rankings API. BenchLM scales the source gscore from 0-1 to 0-100 and stores it as a display-only agentic benchmark.

Coding4 benchmarks

SWE-bench VerifiedSecondary exact

Software Engineering Benchmark Verified

76.7%Weighted 16%

Source: Meta AI: Muse Spark comparison chartProvenance: Secondary exact

SWE-bench ProSecondary exact

51.8%Weighted 10%

Source: Meta AI: Muse Spark comparison chartProvenance: Secondary exact

LiveCodeBench ProSecondary exact

74.2%Display only

Source: Meta AI: Muse Spark comparison chartProvenance: Secondary exact

Vibe Code BenchBenchmark exact

Vibe Code Bench v1.1

4.06%Display only

Source: Vals AI: Vibe Code Bench v1.1Provenance: Vals Vibe Code Bench v1.1 reports this exact row under grok/grok-4.20-0309-reasoning; BenchLM stores it on the local vibeCodeBench key.

Reasoning1 benchmark

ARC-AGI-2Secondary exact

Abstraction and Reasoning Corpus for AGI v2

53.3%Weighted 31%

Source: Meta AI: Muse Spark comparison chartProvenance: Secondary exact

Knowledge4 benchmarks

GPQA-DSecondary exact

GPQA Diamond

88.5%Display only

Source: Meta AI: Muse Spark comparison chartProvenance: BenchLM maps Meta’s “Grok 4.2 Reasoning” chart row onto the tracked Grok 4.20 reasoning family entry.

HLE w/o toolsSecondary exact

Humanity's Last Exam without tools

31.6%Display only

Source: Meta AI: Muse Spark comparison chartProvenance: BenchLM maps Meta’s “Grok 4.2 Reasoning” chart row onto the tracked Grok 4.20 reasoning family entry.

HealthBench HardSecondary exact

20.3%Display only

Source: Meta AI: Muse Spark comparison chartProvenance: Secondary exact

MedXpertQA (Text)Secondary exact

MedXpertQA Text

50.2%Display only

Source: Meta AI: Muse Spark comparison chartProvenance: Secondary exact

Multimodal6 benchmarks

MMMU-ProSecondary exact

Massive Multi-discipline Multimodal Understanding Pro

75.2%Weighted 45%

Source: Meta AI: Muse Spark comparison chartProvenance: Secondary exact

CharXivSecondary exact

CharXiv Reasoning

60.9%Weighted 25%

Source: Meta AI: Muse Spark comparison chartProvenance: Secondary exact

ERQASecondary exact

54.1%Display only

Source: Meta AI: Muse Spark comparison chartProvenance: Secondary exact

SimpleVQASecondary exact

57.4%Display only

Source: Meta AI: Muse Spark comparison chartProvenance: Secondary exact

MedXpertQA (MM)Secondary exact

MedXpertQA Multimodal

65.8%Display only

Source: Meta AI: Muse Spark comparison chartProvenance: Secondary exact

Design Arena WebsiteReported

Design Arena Website Elo

1257Display only

Source: OpenRouter model benchmarksProvenance: Display-only Design Arena Website Elo synced from OpenRouter model benchmark metadata. It is excluded from BenchLM weighted scoring.

Grok 4.20 Family

Reasoning

Related Earlier Model

Grok 4.1

Grok 4.20 Multi-agent

Frequently Asked Questions

How does Grok 4.20 perform overall in AI benchmarks?

Grok 4.20 has 18 published benchmark scores on BenchLM, but it does not yet have enough non-generated coverage to receive a global overall rank.

Is Grok 4.20 good for knowledge and understanding?

Grok 4.20 has visible benchmark coverage in knowledge and understanding, but BenchLM does not currently assign it a global category rank there.

Is Grok 4.20 good for coding and programming?

Grok 4.20 ranks #83 out of 122 models in coding and programming benchmarks with an average score of 47. There are stronger options in this category.

Is Grok 4.20 good for reasoning and logic?

Grok 4.20 has visible benchmark coverage in reasoning and logic, but BenchLM does not currently assign it a global category rank there.

Is Grok 4.20 good for agentic tool use and computer tasks?

Grok 4.20 ranks #49 out of 119 models in agentic tool use and computer tasks benchmarks with an average score of 49.6. There are stronger options in this category.

Is Grok 4.20 good for multimodal and grounded tasks?

Grok 4.20 ranks #25 out of 29 models in multimodal and grounded tasks benchmarks with an average score of 34.9. There are stronger options in this category.

Which sibling models are related to Grok 4.20?

Grok 4.20 belongs to the Grok 4.20 family. Related variants on BenchLM include Grok 4.20 Multi-agent.

Does Grok 4.20 have full benchmark coverage on BenchLM?

Not yet. Grok 4.20 currently has 18 published benchmark scores out of the 323 benchmarks BenchLM tracks. BenchLM only exposes non-generated public benchmark rows, so missing categories stay blank until a sourced evaluation is available.

What is the context window size of Grok 4.20?

Grok 4.20 has a published context window of 2M, which determines how much text it can process in a single interaction.

Related Resources

Last updated: July 23, 2026 · Runtime metrics stay blank until BenchLM has a sourced snapshot.

Choose with this week’s evidence

Join 2,000+ readers for ranking moves, new releases, pricing changes, and the evidence behind them.

Free. One email per week.

Grok 4.20

Evidence coverage

Evidence by category

Peer position

Category percentile

Category evidence

Chatbot Arena performance

Benchmark Details

Grok 4.20 Family

Frequently Asked Questions

How does Grok 4.20 perform overall in AI benchmarks?

Is Grok 4.20 good for knowledge and understanding?

Is Grok 4.20 good for coding and programming?

Is Grok 4.20 good for reasoning and logic?

Is Grok 4.20 good for agentic tool use and computer tasks?

Is Grok 4.20 good for multimodal and grounded tasks?

Which sibling models are related to Grok 4.20?

Does Grok 4.20 have full benchmark coverage on BenchLM?

What is the context window size of Grok 4.20?

Related Resources

Choose with this week’s evidence