Six months ago, GPT-4o was the default for most developers. That world is gone. In March 2026, there are at least four models scoring above 75% on SWE-bench Verified, context windows have converged at 1 million tokens, and DeepSeek is delivering frontier-level coding performance at a fraction of the price of Claude or GPT.

This article compares every major LLM on the benchmarks that actually matter for coding: SWE-bench (real GitHub issue resolution) and Aider's polyglot benchmark (multi-language coding tasks). We include pricing for every model so you can calculate your own cost-performance tradeoff.

SWE-bench Verified: the gold standard

SWE-bench tests whether a model can actually resolve real GitHub issues from popular open-source repositories. It is the closest thing we have to measuring "can this model do real software engineering work." The Verified subset filters for high-quality issues where the fix is unambiguous.

query_stats

SWE-bench Verified is the most widely cited benchmark for evaluating LLMs on real-world software engineering tasks

Source: SWE-bench Official Leaderboardopen_in_new

SWE-bench Verified Leaderboard

Source: swebench.com, March 2026 open_in_new

🥇

Claude 4.5 Opus

76.8%

🥈

Gemini 3 Flash

75.8%

🥉

MiniMax M2.5

75.8%

Claude Opus 4.6

75.6%

GPT-5-2 Codex

72.8%

GLM-5

72.8%

GPT-5-2

72.8%

Claude 4.5 Sonnet

71.4%

Kimi K2.5

70.8%

DeepSeek V3.2

70%

Gemini 3 Pro

69.6%

Claude 4.5 Haiku

66.6%

The top four models are within 1.2 percentage points. Claude 4.5 Opus leads at 76.8%, followed by Gemini 3 Flash and MiniMax M2.5 tied at 75.8%, and Claude Opus 4.6 at 75.6%. At this level, the agent scaffold and prompting strategy matter as much as the raw model.

The gap between the top (76.8%) and the bottom of the top 12 (66.6%) is over 10 percentage points. Even Claude 4.5 Haiku, a lightweight model, still resolves two-thirds of real GitHub issues.

Important caveat: OpenAI has flagged training data contamination concerns across all frontier models on SWE-bench Verified. The newer SWE-Bench Pro benchmark from Scale AI was designed to address this. On that benchmark, Claude Opus 4.5 leads at 45.9%, Claude 4.5 Sonnet at 43.6%, and Gemini 3 Pro at 43.3%. The gaps are wider and the absolute scores are much lower, suggesting real-world difficulty is higher than Verified implies.

Aider Polyglot: real multi-language coding

Aider's benchmark tests models on 225 Exercism exercises across C++, Go, Java, JavaScript, Python, and Rust. Unlike SWE-bench, it measures raw code generation ability across multiple languages, not just Python-heavy GitHub issues. Scores reflect pass rate after one retry (the model sees its error and gets a second attempt). It also reports the cost per run, which makes price-performance comparisons trivial.

query_stats

Aider's polyglot leaderboard tests 225 exercises across 6 languages with full cost tracking

Source: Aider LLM Leaderboardsopen_in_new

Aider Polyglot Leaderboard

Source: aider.chat, November 2025 open_in_new

🥇

GPT-5 (high)

88%

$29.08

🥈

GPT-5 (med)

86.7%

$17.69

🥉

o3-pro (high)

84.9%

$146.32

Gemini 2.5 Pro

83.1%

$49.88

GPT-5 (low)

81.3%

$10.37

o3 (high)

81.3%

$21.23

Grok 4 (high)

79.6%

$59.62

76.9%

$13.75

DeepSeek V3.2

74.2%

$1.30

Claude Opus 4

72%

$65.75

o4-mini (high)

72%

$19.64

DeepSeek R1

71.4%

$4.80

DeepSeek V3.2 Chat

70.2%

$0.88

GPT-5 with high reasoning dominates at 88.0%, but at $29.08 per run. The standout is DeepSeek V3.2 at 74.2% for just $1.30 per run. That is 84% of GPT-5's score at 4.5% of the cost. If you are doing high-volume code generation, DeepSeek is the obvious choice.

Claude Opus 4 scores 72.0% at $65.75 per run, which is expensive relative to its score. However, this benchmark does not measure codebase understanding, context utilization, or multi-file editing, where Claude's large context window and architecture tend to excel.

Pricing: the full picture

Prices have dropped dramatically. Claude Opus went from $15/$75 per million tokens (Opus 4) to $5/$25 (Opus 4.6). GPT-5.4 at $2.50/$15 is cheaper than GPT-4o was at launch. DeepSeek V3.2 is practically free at $0.28/$0.42.

API Pricing (per 1M tokens, USD)

Sources: official API pricing pages, March 2026

Model	Company	Input	Output	Context
Claude Opus 4.6	Anthropic	$5.00	$25.00	1M
Claude Sonnet 4.6	Anthropic	$3.00	$15.00	1M
Claude Haiku 4.5	Anthropic	$1.00	$5.00	200K
GPT-5.4	OpenAI	$2.50	$15.00	1M
GPT-4.1	OpenAI	$2.00	$8.00	1M
GPT-4.1 Mini	OpenAI	$0.40	$1.60	1M
Gemini 3.1 Pro	Google	$2.00	$12.00	1M
Gemini 2.5 Flash	Google	$0.30	$2.50	1M
DeepSeek V3.2	DeepSeek	$0.28	$0.42	128K
Grok 4 Fast	xAI	$0.20	$0.50	2M
Llama 4 Maverick	Meta	$0.15	$0.60	1M
MiniMax M2.5	MiniMax	$0.30	$1.20	205K

The pricing gap is dramatic. DeepSeek V3.2 output costs $0.42 per million tokens. Claude Opus 4.6 output costs $25.00. That is roughly a 60x difference. For bulk tasks like code review, migration, or test generation, the cheaper models deliver 80-90% of the quality at a fraction of the cost.

Our picks by use case

trophyBest overall

Claude Opus 4.6

SWE-bench 75.6%, 1M context. Powers Cursor & Claude Code. $5/$25 per 1M tokens.

psychologyBest problem-solving

GPT-5 (high)

Aider polyglot 88.0%. Reasoning mode tackles harder algorithmic problems.

savingsBest price-perf

DeepSeek V3.2

SWE-bench 70%, Aider 74.2% at $0.28/$0.42. Best for high-volume tasks.

boltBudget frontier

MiniMax M2.5

SWE-bench 75.8% at $0.30/$1.20. Near-frontier at mid-tier pricing.

lock_openBest open-weight

Llama 4 Maverick

$0.15/$0.60 or free self-hosted. 1M context. You own the model.

electric_boltBest for speed

Gemini 2.5 Flash

~250 tok/s, 1M context, $0.30/$2.50. When latency matters most.

The contamination problem

A word of caution about all these benchmarks. OpenAI has publicly raised concerns about training data contamination on SWE-bench Verified. When models train on data that overlaps with test cases, scores inflate. The SWE-Bench Pro benchmark from Scale AI attempts to address this with fresher, less-contaminated issues. On that benchmark, the gaps between models are much wider, and the absolute scores are much lower.

query_stats

SWE-Bench Pro uses fresher issues to reduce contamination. Claude Opus 4.5 leads at 45.9% vs Gemini 3 Pro at 43.3%

Source: Scale AI SEAL Leaderboardopen_in_new

The takeaway: treat SWE-bench Verified as a useful signal, not gospel. The models at the top are genuinely good at coding, but the exact percentages should be taken with a grain of salt. Real-world performance depends on your specific codebase, language, and workflow.

The bottom line

The LLM coding landscape in March 2026 is defined by convergence at the top and divergence in pricing. Four models score within 1.2% of each other on SWE-bench Verified. The real differentiator is not raw benchmark scores but price, speed, context window, and ecosystem integration.

If money is no object, Claude Opus 4.6 or GPT-5.4 are the best. If you care about cost, DeepSeek V3.2 or MiniMax M2.5 deliver shocking value. If you want open-weight, Llama 4 is your only serious option at the frontier level.

In six months, these rankings will be different. The pace of improvement shows no signs of slowing down.

All sources cited in this article

LLMs write code. Interviews test if you can.

AI can help you practice, but the interview is still you, a whiteboard, and a timer. Practice with an AI interviewer that simulates the real thing.

Try a free mock interviewarrow_forward

Frequently Asked Questions

What is the best free LLM for coding?add

DeepSeek V3.2 and Llama 4 Maverick are the strongest free/cheap options. DeepSeek scores 70.0% on SWE-bench Verified and 74.2% on Aider's polyglot benchmark at a fraction of the cost of frontier models. Llama 4 is open-weight and can run locally.

Is Claude or GPT better for coding?add

On SWE-bench Verified, Claude 4.5 Opus leads at 76.8%, with Claude Opus 4.6 at 75.6% and GPT-5-2 at 72.8%. On Aider's polyglot benchmark, GPT-5 scores higher (88%) than Claude Opus 4 (72%). The answer depends on the task: Claude tends to excel at understanding large codebases, GPT-5 at raw problem-solving.

What LLM do professional developers actually use?add

Claude powers Cursor and Claude Code, the two most popular AI coding tools. GPT-4.1/5.x powers GitHub Copilot. Gemini powers Android Studio and Google's internal tools. In practice, most professional developers use whichever model their IDE integrates with.

Is Gemini good for coding?add

Yes. Gemini 3 Flash scores 75.8% on SWE-bench Verified, virtually tied with Claude and MiniMax at the top. Gemini 2.5 Flash is particularly compelling for cost-sensitive use at $0.30/$2.50 per million tokens with a 1M context window.

What LLM has the best price-to-performance for coding?add

DeepSeek V3.2 at $0.28/$0.42 per million tokens. It scores 70.0% on SWE-bench Verified and 74.2% on Aider's polyglot benchmark at a fraction of the cost of frontier models. MiniMax M2.5 at $0.30/$1.20 is another standout, scoring 75.8% on SWE-bench.

Does context window size matter for coding?add

Yes. Larger context windows let the model see more of your codebase at once, which improves suggestions for large projects. Claude 4.6, GPT-4.1, and Gemini 2.5/3.1 all support 1M tokens. Grok 4 Fast supports 2M. Llama 4 Scout claims 10M.

Best LLM for Coding
in 2026

SWE-bench Verified: the gold standard

Aider Polyglot: real multi-language coding

Pricing: the full picture

Our picks by use case

The contamination problem

The bottom line

All sources cited in this article

Frequently Asked Questions

Continue learning