Back to Tools

Comprehensive LLM Comparison

Experimental

Compare 25 major LLMs including GPT-5.1, Kimi K2, Llama 4, Mistral, Claude, Gemini, DeepSeek, Qwen3, and GLM 4.6. From $0.05 to $75 per million tokens.

Updated November 12, 2025 with GPT-5.1 Instant & Thinking, plus latest models from Moonshot AI, Meta, Anthropic, Google, Mistral, DeepSeek, Alibaba, and Zhipu AI

Feature
o3-pro
OpenAI
Kimi K2 Thinking
Moonshot AI
Claude Sonnet 4.5
Anthropic
Claude Opus 4.1
Anthropic
GPT-5.1 Instant
OpenAI
GPT-5.1 Thinking
OpenAI
GPT-5
OpenAI
Gemini 2.5 Pro
Google
o4-mini
OpenAI
Claude Haiku 4.5
Anthropic
Grok 3
xAI
DeepSeek R1
DeepSeek
Mistral Medium 3
Mistral
Gemini 2.0 Flash
Google
DeepSeek V3.1
DeepSeek
GPT-5 Mini
OpenAI
Llama 4 Maverick
Meta
Qwen3-235B
Alibaba
Qwen3-32B
Alibaba
GLM 4.6
Zhipu AI
GPT-4o
OpenAI
Llama 4 Scout
Meta
Mistral Small 3
Mistral
GPT-5 Nano
OpenAI
TierPremiumPremiumMid-tierPremiumPremiumPremiumPremiumPremiumMid-tierMid-tierMid-tierCost-effectiveCost-effectiveCost-effectiveCost-effectiveCost-effectiveCost-effectiveCost-effectiveCost-effectiveCost-effectivePremiumUltra-cheapCost-effectiveUltra-cheap
Context Window200K tokens256K tokens1M tokens1M tokens272K tokens272K tokens272K tokens1M tokens200K tokens200K tokens128K tokens128K tokens128K tokens1M tokens128K tokens272K tokens1M tokens128K tokens128K tokens128K tokens128K tokens10M tokens128K tokens272K tokens
Max Output100K tokens32K tokens8K tokens8K tokens128K tokens128K tokens128K tokens8K tokens100K tokens8K tokens32K tokens32K tokens32K tokens8K tokens32K tokens64K tokens32K tokens32K tokens32K tokens32K tokens16K tokens32K tokens32K tokens32K tokens
Input Cost$15.00$0.80$3.00$15.00$1.25$1.25$1.25$1.25*$1.10$1.00$3.00$0.55$0.40$0.10$0.56$0.25$0.50$0.35$0.20$0.30$2.50$0.11$0.20$0.05
Output Cost$60.00$3.20$15.00$75.00$10.00$10.00$10.00$10.00*$4.40$5.00$15.00$2.19$2.00$0.40$1.68$2.00$0.77$0.60$0.40$0.55$10.00$0.34$0.60$0.40
Cached InputN/AN/A$0.30N/A$0.125$0.125$0.125$0.31N/A$0.10N/A$0.14N/AN/A$0.07$0.025N/AN/AN/AN/AN/AN/AN/A$0.005
Multimodal
Streaming
Function Calling
Prompt Caching
LatencySlow (reasoning)MediumFastMediumFastAdaptiveFastFastMedium (reasoning)Very FastFastMedium (reasoning)FastVery FastFastFastFastFastVery FastFastFastFastVery FastVery Fast
Key Strengths
  • Most capable reasoning
  • Math/science
  • PhD-level tasks
  • Beats GPT-5
  • 71.3% SWE-bench
  • 200-300 tool calls autonomy
  • Best coding/agents
  • 72.7% SWE-bench
  • 90% cache savings
  • Best coding (72.5% SWE-bench)
  • Long-running tasks
  • Agent workflows
  • Adaptive reasoning
  • Warmer & conversational
  • More accurate
  • Advanced reasoning
  • Faster on simple tasks
  • More persistent on complex
  • Software-on-demand
  • Multimodal
  • 88.4% GPQA
  • #1 on LMArena
  • 86.4 GPQA reasoning
  • Deep Think mode
  • Fast reasoning
  • Best on AIME 2024/2025
  • Math/coding
  • Fastest
  • Near Sonnet quality
  • 90% cache savings
  • Real-time search
  • Function calling
  • Fast inference
  • Reasoning model
  • 27x cheaper than o1
  • MIT license
  • 8x cheaper than competitors
  • EU hosting
  • Function calling
  • 1M context
  • Native tool use
  • Multimodal
  • Hybrid thinking/non-thinking
  • 671B params
  • 82.6% HumanEval
  • GPT-5 quality
  • 272K context
  • Multimodal
  • 400B params MoE
  • Multimodal
  • Open weights
  • 235B params (22B active)
  • Hybrid thinking
  • Apache 2.0
  • Outperforms o1-mini
  • Strong reasoning
  • Apache 2.0
  • 355B MoE
  • Bilingual (CN/EN)
  • MIT license
  • Flagship general-purpose
  • Multimodal
  • Versatile
  • 10M context!
  • Multimodal
  • Open weights
  • 24B params
  • Fast inference
  • EU compliant
  • High throughput
  • Simple tasks
  • 272K context
Notes
Released June 2025. Highest reasoning capability
Released Nov 6, 2025. 1T params open-source, trained for only $4.6M
Released Sep 2025. Anthropic's best coding model
Released Aug 2025. Works continuously for hours on complex tasks
Released Nov 12, 2025. Most-used model with adaptive reasoning capability
Released Nov 12, 2025. Advanced reasoning model, easier to understand
Released Aug 2025. Legacy model, replaced by GPT-5.1
*Tiered: Released March 2025. Google's most expensive model
Released April 2025. Replaces o3-mini
Released Oct 2025. Within 5% of Sonnet at 1/3 cost
API launched April 2025. Compatible with OpenAI SDK
Released Jan 2025. Comparable to o1 at 3.6% of cost
Released Jan 2025. Best value for production workloads
Next-gen features with superior speed
Released Aug 2025. MIT license. Beats GPT-4 at 17% of cost
Released Aug 2025. Smaller, faster GPT-5 variant
Released April 2025. 17B active, 128 experts
MoE architecture, open-source under Apache 2.0
Latest Qwen model, excellent performance-to-cost ratio
Strong Chinese/English performance, open-source
Best for majority of tasks across industries
Released April 2025. 10M = ~7,500 pages. 17B active, 109B total
Released Jan 2025. Compact and efficient
Released Aug 2025. Excels at classification and simple instructions
Learn MoreView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView DetailsView Details

Quick Picks by Use Case

Best for Coding

GPT-5.1 Instant ($1.25/$10)

Latest from OpenAI with adaptive reasoning. More accurate and conversational than GPT-5.

Claude Sonnet 4.5 ($3/$15)

1M context for entire codebases. Best for agents and computer use.

Claude Haiku 4.5 ($1/$5)

Fast coding at 1/3 the cost of Sonnet. Great for agent tasks.

Best for Long Context

Llama 4 Scout ($0.15/$0.50) 🏆

10M tokens! Process 7,500 pages. Open weights. Multimodal.

Claude Sonnet 4.5 ($3/$15)

1M context with best reasoning quality and prompt caching.

Gemini 2.5 Pro/Flash ($1.25/$10)

1M context with thinking mode. Flash is cost-effective option.

Cheapest Options

GPT-5 nano ($0.05/$0.40)

Cheapest with 272K context. GPT-5 family efficiency.

Mistral Small 3 ($0.10/$0.30)

3x faster than Llama 3.3. Very low latency.

Llama 4 Scout ($0.15/$0.50)

10M context at ultra-low price. Open weights.

GPT-4o mini ($0.15/$0.60)

Multimodal, production-ready. Widely supported.

Open Source Leaders

Llama 4 Scout

10M context, 109B params. Meta's flagship.

Llama 4 Maverick

400B total, 17B active. Multimodal MoE.

DeepSeek V3/R1

671B params. Ultra-cheap reasoning and chat.

Qwen 2.5 72B

Apache 2.0. 29 languages. 72.7B params.

Need Help Choosing?
Try our interactive use case matcher to find the best model for your specific needs