Large language models are emerging one after another, and various benchmark data are scattered across different blog posts and technical reports from multiple organizations. With so many benchmarks and no unified summary available, users lack a clear and concise comparison of model capabilities.
benchmark | SimpleQA | HumanEval | MATH | MGSM | DROP | MMLU | GPQA | GPQA Diamond | MMMU | AIME 2024 | Aider Polyglot | Humanity’s Last Exam | AIME 2025 | SWE-bench Verified | MATH 500 | LiveCodeBench V5 | Vibe-Eval (Reka) | MMMLU | IFEval | TAU-bench | MRCR | Global MMLU (Lite) | Video-MME (Overall) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Claude 3 Opus | 23.5 | 84.9 | 60.1 | 90.7 | 83.1 | 86.8 | 50.4 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
Claude 3.5 Sonnet | 28.9 | 92.0 | 71.1 | 91.6 | 87.1 | 88.3 | 59.4 | 65.0 | 70.4 | 16.0 | / | / | / | 49.0 | 78.0 | / | / | 82.1 | 90.2 | 48.8 | / | / | / |
Claude 3.7 Sonnet | / | / | / | / | / | / | / | 78.2 | 75.0 | 61.3 | 64.9 | 8.9 | 49.5 | 62.3 | 96.2 | / | / | 86.1 | 93.2 | 81.2 | / | / | / |
DeepSeek R1 | 30.1 | / | / | / | / | / | / | 71.5 | / | 79.8 | 56.9 | 8.6 | 70.0 | 49.2 | 97.3 | 64.3 | / | / | 83.3 | / | / | / | / |
GPT-4.1 | 41.6 | 94.5 | 82.1 | 86.9 | 79.4 | 90.2 | 66.3 | 66.3 | 75.0 | / | 52.9 | 5.4 | / | 54.6 | / | / | / | / | / | / | / | / | / |
GPT-4.1-mini | 16.8 | 93.8 | 81.4 | 88.2 | 81.0 | 87.5 | 65.0 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
GPT-4.1-nano | 7.6 | 87.0 | 62.3 | 73.0 | 82.2 | 80.1 | 50.3 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
GPT-4.5-preview | 62.5 | 88.6 | 87.1 | 86.9 | 83.4 | 90.8 | 69.5 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
GPT-4o | 38.8 | 90.2 | 68.5 | 90.3 | 81.5 | 85.7 | 46.0 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
GPT-4o-mini | 9.5 | 87.2 | 70.2 | 87.0 | 79.7 | 82.0 | 40.2 | 81.4 | 81.6 | 93.4 | 58.2 | 14.3 | 92.7 | / | / | / | / | / | / | / | / | / | / |
Gemini 1.0 Ultra | / | 74.4 | 53.2 | 79.0 | 82.4 | 83.7 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
Gemini 1.5 Flash | / | 71.5 | 40.9 | 75.5 | 78.4 | 77.9 | 38.6 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
Gemini 1.5 Pro | / | 71.9 | 58.5 | 88.7 | 78.9 | 81.9 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
Gemini 2.0 Flash | 29.9 | / | / | / | / | / | / | 60.1 | 71.7 | 32.0 | 22.2 | 5.1 | 27.5 | / | / | 34.5 | 56.4 | / | / | / | 74.2 | 83.4 | / |
Gemini 2.5 Flash | 29.7 | / | / | / | / | / | / | 78.3 | 76.7 | 88.0 | 44.2 | 12.1 | 78.0 | / | / | 63.5 | 62.0 | / | / | / | 84.6 | 88.4 | / |
Gemini 2.5 Pro | 50.8 | / | / | / | / | / | / | 83.0 | 79.6 | / | 76.5 | 17.8 | 83.0 | 63.2 | / | 75.6 | 65.6 | / | / | / | 93.0 | 88.6 | 84.8 |
Grok 2 | / | 88.4 | 76.1 | / | / | 87.5 | 56.0 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
Grok 2 mini | / | 85.7 | 73.0 | / | / | 86.2 | 51.0 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
Grok 3 Beta | 43.6 | / | / | / | / | / | / | 84.6 | 78.0 | 93.3 | 53.3 | / | 93.3 | / | / | 79.4 | / | / | / | / | / | / | / |
Llama 3.1 | / | 72.6 | 51.9 | 68.9 | 59.5 | 68.4 | 30.4 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
gpt-4 | / | 86.6 | 64.5 | 85.1 | 81.5 | 85.4 | 41.4 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
gpt-4-turbo | 24.2 | 88.2 | 73.4 | 89.6 | 86.0 | 86.7 | 49.3 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
o1 | 42.6 | / | 96.4 | 89.3 | 90.2 | 91.8 | 75.7 | 78.0 | 78.2 | 83.3 | / | / | / | 48.9 | 96.4 | / | / | 87.7 | / | 54.2 | / | / | / |
o1-mini | 7.6 | 92.4 | 90.0 | 89.9 | 83.9 | 85.2 | 60.0 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
o1-preview | 42.4 | 92.4 | 85.5 | 90.8 | 74.8 | 90.8 | 73.3 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
o3 | 49.4 | 87.4 | 97.8 | 92.3 | 80.6 | 92.9 | 82.8 | 83.3 | 82.9 | / | 79.6 | 20.3 | 88.9 | 69.1 | / | / | / | / | / | / | / | / | / |
o3-high | 48.6 | 88.4 | 98.1 | 92.0 | 89.8 | 93.3 | 83.4 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
o3-low | 49.4 | 87.3 | 96.9 | 91.9 | 82.3 | 92.8 | 78.6 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
o3-mini | 13.4 | 96.3 | 97.3 | 90.8 | 79.2 | 85.9 | 74.9 | 79.7 | / | 87.3 | / | / | / | 49.3 | 97.9 | / | / | 79.5 | / | / | / | / | / |
o3-mini-high | 13.8 | 97.6 | 97.9 | 92.0 | 80.6 | 86.9 | 77.2 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
o3-mini-low | 13.0 | 94.5 | 95.8 | 89.4 | 77.6 | 84.9 | 67.6 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
o4-mini | 20.2 | 97.3 | 97.5 | 93.7 | 77.7 | 90.0 | 77.6 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
o4-mini-high | 19.3 | 99.3 | 98.2 | 93.5 | 78.1 | 90.3 | 81.3 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
o4-mini-low | 20.2 | 95.9 | 96.2 | 93.0 | 76.0 | 89.5 | 73.6 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
This website provides a comprehensive collection of large language model benchmark datasets, striving to gather existing benchmark data for available models and presenting them in a clear tabular format. In addition to the table display, we also offer downloads in CSV format, and provide the complete original data in JSON format for users to download. The JSON fields include model name, benchmark name, testing method, and source link, offering readers a detailed and traceable dataset.
This large language model benchmark dataset collection will be updated dynamically. We welcome continuous feedback from readers and will incorporate new data as soon as possible.
Citation Format
Cheng Xuanda. llm benchmark [Dataset]. Laptop Review, 19 May 2025, https://laptopreview.club/introducing-llm-benchmark-dataset/
@dataset{ChengLLMDataset,
author = {Cheng Xuanda},
title = {llm benchmark},
year = {2025},
url = {https://laptopreview.club/introducing-llm-benchmark-dataset/},
note = {Data set, Laptop Review, 2025-05-19}
}
Update Log
May 19, 2025
First release. Covers dozens of benchmark datasets including SimpleQA, HumanEval, GPQA, GPQA Diamond, MMLU, etc. Includes over 30 models from OpenAI, Anthropic, Google, and more.