Large language models are emerging one after another, and various benchmark data are scattered across different blog posts and technical reports from multiple organizations. With so many benchmarks and no unified summary available, users lack a clear and concise comparison of model capabilities.

benchmark SimpleQA HumanEval MATH MGSM DROP MMLU GPQA GPQA Diamond MMMU AIME 2024 Aider Polyglot Humanity’s Last Exam AIME 2025 SWE-bench Verified MATH 500 LiveCodeBench V5 Vibe-Eval (Reka) MMMLU IFEval TAU-bench MRCR Global MMLU (Lite) Video-MME (Overall)
Claude 3 Opus 23.5 84.9 60.1 90.7 83.1 86.8 50.4 / / / / / / / / / / / / / / / /
Claude 3.5 Sonnet 28.9 92.0 71.1 91.6 87.1 88.3 59.4 65.0 70.4 16.0 / / / 49.0 78.0 / / 82.1 90.2 48.8 / / /
Claude 3.7 Sonnet / / / / / / / 78.2 75.0 61.3 64.9 8.9 49.5 62.3 96.2 / / 86.1 93.2 81.2 / / /
DeepSeek R1 30.1 / / / / / / 71.5 / 79.8 56.9 8.6 70.0 49.2 97.3 64.3 / / 83.3 / / / /
GPT-4.1 41.6 94.5 82.1 86.9 79.4 90.2 66.3 66.3 75.0 / 52.9 5.4 / 54.6 / / / / / / / / /
GPT-4.1-mini 16.8 93.8 81.4 88.2 81.0 87.5 65.0 / / / / / / / / / / / / / / / /
GPT-4.1-nano 7.6 87.0 62.3 73.0 82.2 80.1 50.3 / / / / / / / / / / / / / / / /
GPT-4.5-preview 62.5 88.6 87.1 86.9 83.4 90.8 69.5 / / / / / / / / / / / / / / / /
GPT-4o 38.8 90.2 68.5 90.3 81.5 85.7 46.0 / / / / / / / / / / / / / / / /
GPT-4o-mini 9.5 87.2 70.2 87.0 79.7 82.0 40.2 81.4 81.6 93.4 58.2 14.3 92.7 / / / / / / / / / /
Gemini 1.0 Ultra / 74.4 53.2 79.0 82.4 83.7 / / / / / / / / / / / / / / / / /
Gemini 1.5 Flash / 71.5 40.9 75.5 78.4 77.9 38.6 / / / / / / / / / / / / / / / /
Gemini 1.5 Pro / 71.9 58.5 88.7 78.9 81.9 / / / / / / / / / / / / / / / / /
Gemini 2.0 Flash 29.9 / / / / / / 60.1 71.7 32.0 22.2 5.1 27.5 / / 34.5 56.4 / / / 74.2 83.4 /
Gemini 2.5 Flash 29.7 / / / / / / 78.3 76.7 88.0 44.2 12.1 78.0 / / 63.5 62.0 / / / 84.6 88.4 /
Gemini 2.5 Pro 50.8 / / / / / / 83.0 79.6 / 76.5 17.8 83.0 63.2 / 75.6 65.6 / / / 93.0 88.6 84.8
Grok 2 / 88.4 76.1 / / 87.5 56.0 / / / / / / / / / / / / / / / /
Grok 2 mini / 85.7 73.0 / / 86.2 51.0 / / / / / / / / / / / / / / / /
Grok 3 Beta 43.6 / / / / / / 84.6 78.0 93.3 53.3 / 93.3 / / 79.4 / / / / / / /
Llama 3.1 / 72.6 51.9 68.9 59.5 68.4 30.4 / / / / / / / / / / / / / / / /
gpt-4 / 86.6 64.5 85.1 81.5 85.4 41.4 / / / / / / / / / / / / / / / /
gpt-4-turbo 24.2 88.2 73.4 89.6 86.0 86.7 49.3 / / / / / / / / / / / / / / / /
o1 42.6 / 96.4 89.3 90.2 91.8 75.7 78.0 78.2 83.3 / / / 48.9 96.4 / / 87.7 / 54.2 / / /
o1-mini 7.6 92.4 90.0 89.9 83.9 85.2 60.0 / / / / / / / / / / / / / / / /
o1-preview 42.4 92.4 85.5 90.8 74.8 90.8 73.3 / / / / / / / / / / / / / / / /
o3 49.4 87.4 97.8 92.3 80.6 92.9 82.8 83.3 82.9 / 79.6 20.3 88.9 69.1 / / / / / / / / /
o3-high 48.6 88.4 98.1 92.0 89.8 93.3 83.4 / / / / / / / / / / / / / / / /
o3-low 49.4 87.3 96.9 91.9 82.3 92.8 78.6 / / / / / / / / / / / / / / / /
o3-mini 13.4 96.3 97.3 90.8 79.2 85.9 74.9 79.7 / 87.3 / / / 49.3 97.9 / / 79.5 / / / / /
o3-mini-high 13.8 97.6 97.9 92.0 80.6 86.9 77.2 / / / / / / / / / / / / / / / /
o3-mini-low 13.0 94.5 95.8 89.4 77.6 84.9 67.6 / / / / / / / / / / / / / / / /
o4-mini 20.2 97.3 97.5 93.7 77.7 90.0 77.6 / / / / / / / / / / / / / / / /
o4-mini-high 19.3 99.3 98.2 93.5 78.1 90.3 81.3 / / / / / / / / / / / / / / / /
o4-mini-low 20.2 95.9 96.2 93.0 76.0 89.5 73.6 / / / / / / / / / / / / / / / /

This website provides a comprehensive collection of large language model benchmark datasets, striving to gather existing benchmark data for available models and presenting them in a clear tabular format. In addition to the table display, we also offer downloads in CSV format, and provide the complete original data in JSON format for users to download. The JSON fields include model name, benchmark name, testing method, and source link, offering readers a detailed and traceable dataset.

This large language model benchmark dataset collection will be updated dynamically. We welcome continuous feedback from readers and will incorporate new data as soon as possible.

Citation Format

Cheng Xuanda. llm benchmark [Dataset]. Laptop Review, 19 May 2025, https://laptopreview.club/introducing-llm-benchmark-dataset/

@dataset{ChengLLMDataset,
  author = {Cheng Xuanda},
  title = {llm benchmark},
  year = {2025},
  url = {https://laptopreview.club/introducing-llm-benchmark-dataset/},
  note = {Data set, Laptop Review, 2025-05-19}
}

Update Log

May 19, 2025

First release. Covers dozens of benchmark datasets including SimpleQA, HumanEval, GPQA, GPQA Diamond, MMLU, etc. Includes over 30 models from OpenAI, Anthropic, Google, and more.

Loading

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注