The Complete Online LLM Rankings — Every Model, Every Price, What Works Right Now
There are over 50 significant LLMs available online today across a dozen providers. Most are fine at everything and great at nothing. The ones that actually excel at specific tasks are the ones worth knowing about.
This page tracks which models lead in each category, how those rankings shift over time, exactly what they cost, and exactly how we test them. The data updates automatically every day based on the latest Chatbot Arena results, benchmark scores, and real-world performance tests.
How These Rankings Work
Our Testing Method (In Plain English)
We score each model on a 100-point scale across three areas:
- Accuracy (40 points): Does it get the answer right? We run standardized tests for each field — coding problems, reasoning puzzles, creative writing, multimodal understanding.
- Reliability (35 points): Is it consistent? We test each model 50 times on similar tasks and measure how often it produces quality output without errors or hallucinations.
- Usability (25 points): Is it practical? We factor in speed, cost, ease of access, and how well the model explains its reasoning.
The Complete Model Directory
Every model currently available online, ranked by Text Arena Elo score, with pricing and key specs.
Updated
💡 Drag and drop column headers to reorder columns. Scrollbar stays visible at the bottom even when you scroll down the page.
Current Rankings by Category
Where to Start
You do not need the top model in every category. You need the right tool for the work you actually do.
For most people: Start with Claude Sonnet 4.7 as your general assistant ($3/$15). It’s 40% cheaper than Opus, ranks in the top 3 across most categories, and handles everyday tasks exceptionally well.
For developers: Claude Opus 4.7 for hard coding, but consider Qwen 3.7 Max ($2.50/$7.50) as your volume default — it’s a third of the price with a #4 Code Arena ranking.
For enterprises: Gemini 2.5 Pro ($2/$12) offers the best price-to-performance ratio for general tasks, with a 1M token context window.
For cost-sensitive teams: DeepSeek V4 Flash ($0.14/$0.28) is the cheapest production-ready model. Qwen 3.7 Plus ($0.40/$1.60) adds vision for barely more.
Methodology Notes
Rankings are derived from:
- Chatbot Arena (LMSYS) — 6M+ blind pairwise human votes across text, code, and vision tasks
- SWE-bench Verified — Real GitHub issue resolution
- MMLU-Pro — Professional knowledge assessment
- Humanity’s Last Exam — Expert-level reasoning
- LiveCodeBench — Real-world coding problems
- Terminal-Bench — Long-horizon terminal-based coding
- ARC-AGI-2 — Fluid intelligence benchmark
These rankings update automatically every 24 hours based on the latest Arena data and benchmark releases.

