Back to Tools

Comprehensive LLM Comparison

Experimental

Compare 32 major LLMs including Claude Opus 4.5, Gemini 3 Pro, GPT-5.2, GPT-5.1, Grok 4.1, Kimi K2, Llama 4, Mistral, Claude, DeepSeek, Qwen3, and GLM 4.6. From $0.05 to $75 per million tokens.

Updated December 19, 2025 with GPT-5.2 Pro, Thinking, Instant & Codex, plus Claude Opus 4.5, Gemini 3 Pro, GPT-5.1 Instant & Thinking, and latest models from Moonshot AI, Meta, Anthropic, Google, Mistral, DeepSeek, Alibaba, and Zhipu AI

Feature
Gemini 3 Pro
Google
o3-pro
OpenAI
Kimi K2 Thinking
Moonshot AI
Claude Opus 4.5
Anthropic
Claude Sonnet 4.5
Anthropic
Claude Opus 4.1
Anthropic
GPT-5.2 Pro
OpenAI
GPT-5.2 Thinking
OpenAI
GPT-5.2 Instant
OpenAI
GPT-5.2 Codex
OpenAI
GPT-5.1 Instant
OpenAI
GPT-5.1 Thinking
OpenAI
GPT-5
OpenAI
Gemini 2.5 Pro
Google
Grok 4.1
xAI
Grok 4
xAI
o4-mini
OpenAI
Claude Haiku 4.5
Anthropic
Grok 3
xAI
DeepSeek R1
DeepSeek
Mistral Medium 3
Mistral
Gemini 2.0 Flash
Google
DeepSeek V3.1
DeepSeek
GPT-5 Mini
OpenAI
Llama 4 Maverick
Meta
Qwen3-235B
Alibaba
Qwen3-32B
Alibaba
GLM 4.6
Zhipu AI
GPT-4o
OpenAI
Llama 4 Scout
Meta
Mistral Small 3
Mistral
GPT-5 Nano
OpenAI
TierPremiumPremiumPremiumPremiumMid-tierPremiumPremiumPremiumPremiumPremiumPremiumPremiumPremiumPremiumPremiumPremiumMid-tierMid-tierMid-tierCost-effectiveCost-effectiveCost-effectiveCost-effectiveCost-effectiveCost-effectiveCost-effectiveCost-effectiveCost-effectivePremiumUltra-cheapCost-effectiveUltra-cheap
Context Window1M tokens200K tokens256K tokens1M tokens1M tokens1M tokens400K tokens400K tokens400K tokens400K tokens272K tokens272K tokens272K tokens1M tokens2M tokens128K tokens200K tokens200K tokens128K tokens128K tokens128K tokens1M tokens128K tokens272K tokens1M tokens128K tokens128K tokens128K tokens128K tokens10M tokens128K tokens272K tokens
Max Output64K tokens100K tokens32K tokens8K tokens8K tokens8K tokens128K tokens128K tokens128K tokens128K tokens128K tokens128K tokens128K tokens8K tokens32K tokens32K tokens100K tokens8K tokens32K tokens32K tokens32K tokens8K tokens32K tokens64K tokens32K tokens32K tokens32K tokens32K tokens16K tokens32K tokens32K tokens32K tokens
Input Cost$2.00*$15.00$0.80$5.00$3.00$15.00$21.00$1.75$1.75$1.75$1.25$1.25$1.25$1.25*$0.20$3.00$1.10$1.00$3.00$0.55$0.40$0.10$0.56$0.25$0.50$0.35$0.20$0.30$2.50$0.11$0.20$0.05
Output Cost$12.00*$60.00$3.20$25.00$15.00$75.00$168.00$14.00$14.00$14.00$10.00$10.00$10.00$10.00*$0.50$15.00$4.40$5.00$15.00$2.19$2.00$0.40$1.68$2.00$0.77$0.60$0.40$0.55$10.00$0.34$0.60$0.40
Cached InputN/AN/AN/A$0.50$0.30N/A$2.10$0.175$0.175$0.175$0.125$0.125$0.125$0.31$0.05$0.75N/A$0.10N/A$0.14N/AN/A$0.07$0.025N/AN/AN/AN/AN/AN/AN/A$0.005
Multimodal
Streaming
Function Calling
Prompt Caching
LatencyFastSlow (reasoning)MediumMediumFastMediumMediumMedium (reasoning)FastAdaptiveFastAdaptiveFastFastFastFastMedium (reasoning)Very FastFastMedium (reasoning)FastVery FastFastFastFastFastVery FastFastFastFastVery FastVery Fast
Key Strengths
  • #1 on WebDev Arena (1487 ELO)
  • 76.2% SWE-bench Verified
  • 64K output
  • Gemini Agent
  • Most capable reasoning
  • Math/science
  • PhD-level tasks
  • Beats GPT-5
  • 71.3% SWE-bench
  • 200-300 tool calls autonomy
  • Tops SWE-bench Verified
  • 10.6% better than Sonnet 4.5
  • Best for agents & coding
  • Most robustly aligned
  • Best coding/agents
  • 72.7% SWE-bench
  • 90% cache savings
  • Best coding (72.5% SWE-bench)
  • Long-running tasks
  • Agent workflows
  • Human expert level
  • GDPval SOTA
  • Professional knowledge work
  • 400K context
  • Enhanced reasoning
  • 5x GPT-4 context
  • Long-context tasks
  • 90% cache discount
  • 400K context
  • Speed optimized
  • Everyday work
  • Same pricing as Thinking
  • 56.4% SWE-Bench Pro
  • Agentic coding
  • Cybersecurity
  • Context compaction
  • Windows optimization
  • Adaptive reasoning
  • Warmer & conversational
  • More accurate
  • Advanced reasoning
  • Faster on simple tasks
  • More persistent on complex
  • Software-on-demand
  • Multimodal
  • 88.4% GPQA
  • #1 on LMArena
  • 86.4 GPQA reasoning
  • Deep Think mode
  • 2M context window
  • 3x lower hallucination rate
  • Top EQ-Bench3 scores
  • Real-time web search
  • Real-time web search
  • Trained on Colossus
  • Multimodal
  • Fast reasoning
  • Best on AIME 2024/2025
  • Math/coding
  • Fastest
  • Near Sonnet quality
  • 90% cache savings
  • Real-time search
  • Function calling
  • Fast inference
  • Reasoning model
  • 27x cheaper than o1
  • MIT license
  • 8x cheaper than competitors
  • EU hosting
  • Function calling
  • 1M context
  • Native tool use
  • Multimodal
  • Hybrid thinking/non-thinking
  • 671B params
  • 82.6% HumanEval
  • GPT-5 quality
  • 272K context
  • Multimodal
  • 400B params MoE
  • Multimodal
  • Open weights
  • 235B params (22B active)
  • Hybrid thinking
  • Apache 2.0
  • Outperforms o1-mini
  • Strong reasoning
  • Apache 2.0
  • 355B MoE
  • Bilingual (CN/EN)
  • MIT license
  • Flagship general-purpose
  • Multimodal
  • Versatile
  • 10M context!
  • Multimodal
  • Open weights
  • 24B params
  • Fast inference
  • EU compliant
  • High throughput
  • Simple tasks
  • 272K context
Notes
*Tiered: Released Nov 18, 2025. Google's most intelligent model with agentic capabilities
Released June 2025. Highest reasoning capability
Released Nov 6, 2025. 1T params open-source, trained for only $4.6M
Released Nov 24, 2025. Best-in-class for coding, agents, and autonomous tasks
Released Sep 2025. Anthropic's best coding model
Released Aug 2025. Works continuously for hours on complex tasks
Released Dec 11, 2025. Most capable for professional tasks across 44 occupations
Released Dec 11, 2025. 40% higher than GPT-5.1, advanced reasoning with 400K context
Released Dec 11, 2025. Fast workhorse for info-seeking, technical writing, translation
Released Dec 18, 2025. Advanced agentic coding for professional software engineering and defensive cybersecurity
Released Nov 12, 2025. Most-used model with adaptive reasoning capability
Released Nov 12, 2025. Advanced reasoning model, easier to understand
Released Aug 2025. Legacy model, replaced by GPT-5.1
*Tiered: Released March 2025. Google's most expensive model
Released Nov 2025. 3x less likely to hallucinate. #1 in LMArena Text Arena (Thinking variant). 2M token context with consistent performance.
Released Aug 2025. Knowledge cutoff: Nov 2024
Released April 2025. Replaces o3-mini
Released Oct 2025. Within 5% of Sonnet at 1/3 cost
API launched April 2025. Compatible with OpenAI SDK
Released Jan 2025. Comparable to o1 at 3.6% of cost
Released Jan 2025. Best value for production workloads
Next-gen features with superior speed
Released Aug 2025. MIT license. Beats GPT-4 at 17% of cost
Released Aug 2025. Smaller, faster GPT-5 variant
Released April 2025. 17B active, 128 experts
MoE architecture, open-source under Apache 2.0
Latest Qwen model, excellent performance-to-cost ratio
Strong Chinese/English performance, open-source
Best for majority of tasks across industries
Released April 2025. 10M = ~7,500 pages. 17B active, 109B total
Released Jan 2025. Compact and efficient
Released Aug 2025. Excels at classification and simple instructions
Learn MoreView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView Details

Quick Picks by Use Case

Best for Coding

Claude Opus 4.5 ($5/$25) 🏆

Tops SWE-bench Verified. 10.6% better than Sonnet. Best for agents & autonomous tasks.

Gemini 3 Pro ($2/$12)

#1 on WebDev Arena. 76.2% SWE-bench Verified. 64K output with Gemini Agent.

GPT-5.1 Instant ($1.25/$10)

Latest from OpenAI with adaptive reasoning. More accurate and conversational than GPT-5.

Claude Sonnet 4.5 ($3/$15)

1M context for entire codebases. Best for agents and computer use.

Best for Long Context

Llama 4 Scout ($0.15/$0.50) 🏆

10M tokens! Process 7,500 pages. Open weights. Multimodal.

Claude Sonnet 4.5 ($3/$15)

1M context with best reasoning quality and prompt caching.

Gemini 2.5 Pro/Flash ($1.25/$10)

1M context with thinking mode. Flash is cost-effective option.

Cheapest Options

GPT-5 nano ($0.05/$0.40)

Cheapest with 272K context. GPT-5 family efficiency.

Mistral Small 3 ($0.10/$0.30)

3x faster than Llama 3.3. Very low latency.

Llama 4 Scout ($0.15/$0.50)

10M context at ultra-low price. Open weights.

GPT-4o mini ($0.15/$0.60)

Multimodal, production-ready. Widely supported.

Open Source Leaders

Llama 4 Scout

10M context, 109B params. Meta's flagship.

Llama 4 Maverick

400B total, 17B active. Multimodal MoE.

DeepSeek V3/R1

671B params. Ultra-cheap reasoning and chat.

Qwen 2.5 72B

Apache 2.0. 29 languages. 72.7B params.

Need Help Choosing?
Try our interactive use case matcher to find the best model for your specific needs