大语言模型层出不穷,各种评测数据零散在各家发布的Blog Post、Technical Report当中。评测项目繁多,市面上缺乏统一的汇总结果,给广大用户提供一目了然的模型能力比较。
本网站则提供了大语言模型评测数据汇总集,尽量收集市面上已有模型的评测数据,并提供表格方式一表看懂。
benchmark | SimpleQA | HumanEval | MATH | MGSM | DROP | MMLU | GPQA | GPQA Diamond | MMMU | AIME 2024 | Aider Polyglot | Humanity’s Last Exam | AIME 2025 | SWE-bench Verified | MATH 500 | LiveCodeBench V5 | Vibe-Eval (Reka) | MMMLU | IFEval | TAU-bench | MRCR | Global MMLU (Lite) | Video-MME (Overall) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Claude 3 Opus | 23.5 | 84.9 | 60.1 | 90.7 | 83.1 | 86.8 | 50.4 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
Claude 3.5 Sonnet | 28.9 | 92.0 | 71.1 | 91.6 | 87.1 | 88.3 | 59.4 | 65.0 | 70.4 | 16.0 | / | / | / | 49.0 | 78.0 | / | / | 82.1 | 90.2 | 48.8 | / | / | / |
Claude 3.7 Sonnet | / | / | / | / | / | / | / | 78.2 | 75.0 | 61.3 | 64.9 | 8.9 | 49.5 | 62.3 | 96.2 | / | / | 86.1 | 93.2 | 81.2 | / | / | / |
DeepSeek R1 | 30.1 | / | / | / | / | / | / | 71.5 | / | 79.8 | 56.9 | 8.6 | 70.0 | 49.2 | 97.3 | 64.3 | / | / | 83.3 | / | / | / | / |
GPT-4.1 | 41.6 | 94.5 | 82.1 | 86.9 | 79.4 | 90.2 | 66.3 | 66.3 | 75.0 | / | 52.9 | 5.4 | / | 54.6 | / | / | / | / | / | / | / | / | / |
GPT-4.1-mini | 16.8 | 93.8 | 81.4 | 88.2 | 81.0 | 87.5 | 65.0 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
GPT-4.1-nano | 7.6 | 87.0 | 62.3 | 73.0 | 82.2 | 80.1 | 50.3 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
GPT-4.5-preview | 62.5 | 88.6 | 87.1 | 86.9 | 83.4 | 90.8 | 69.5 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
GPT-4o | 38.8 | 90.2 | 68.5 | 90.3 | 81.5 | 85.7 | 46.0 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
GPT-4o-mini | 9.5 | 87.2 | 70.2 | 87.0 | 79.7 | 82.0 | 40.2 | 81.4 | 81.6 | 93.4 | 58.2 | 14.3 | 92.7 | / | / | / | / | / | / | / | / | / | / |
Gemini 1.0 Ultra | / | 74.4 | 53.2 | 79.0 | 82.4 | 83.7 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
Gemini 1.5 Flash | / | 71.5 | 40.9 | 75.5 | 78.4 | 77.9 | 38.6 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
Gemini 1.5 Pro | / | 71.9 | 58.5 | 88.7 | 78.9 | 81.9 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
Gemini 2.0 Flash | 29.9 | / | / | / | / | / | / | 60.1 | 71.7 | 32.0 | 22.2 | 5.1 | 27.5 | / | / | 34.5 | 56.4 | / | / | / | 74.2 | 83.4 | / |
Gemini 2.5 Flash | 29.7 | / | / | / | / | / | / | 78.3 | 76.7 | 88.0 | 44.2 | 12.1 | 78.0 | / | / | 63.5 | 62.0 | / | / | / | 84.6 | 88.4 | / |
Gemini 2.5 Pro | 50.8 | / | / | / | / | / | / | 83.0 | 79.6 | / | 76.5 | 17.8 | 83.0 | 63.2 | / | 75.6 | 65.6 | / | / | / | 93.0 | 88.6 | 84.8 |
Grok 2 | / | 88.4 | 76.1 | / | / | 87.5 | 56.0 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
Grok 2 mini | / | 85.7 | 73.0 | / | / | 86.2 | 51.0 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
Grok 3 Beta | 43.6 | / | / | / | / | / | / | 84.6 | 78.0 | 93.3 | 53.3 | / | 93.3 | / | / | 79.4 | / | / | / | / | / | / | / |
Llama 3.1 | / | 72.6 | 51.9 | 68.9 | 59.5 | 68.4 | 30.4 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
gpt-4 | / | 86.6 | 64.5 | 85.1 | 81.5 | 85.4 | 41.4 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
gpt-4-turbo | 24.2 | 88.2 | 73.4 | 89.6 | 86.0 | 86.7 | 49.3 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
o1 | 42.6 | / | 96.4 | 89.3 | 90.2 | 91.8 | 75.7 | 78.0 | 78.2 | 83.3 | / | / | / | 48.9 | 96.4 | / | / | 87.7 | / | 54.2 | / | / | / |
o1-mini | 7.6 | 92.4 | 90.0 | 89.9 | 83.9 | 85.2 | 60.0 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
o1-preview | 42.4 | 92.4 | 85.5 | 90.8 | 74.8 | 90.8 | 73.3 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
o3 | 49.4 | 87.4 | 97.8 | 92.3 | 80.6 | 92.9 | 82.8 | 83.3 | 82.9 | / | 79.6 | 20.3 | 88.9 | 69.1 | / | / | / | / | / | / | / | / | / |
o3-high | 48.6 | 88.4 | 98.1 | 92.0 | 89.8 | 93.3 | 83.4 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
o3-low | 49.4 | 87.3 | 96.9 | 91.9 | 82.3 | 92.8 | 78.6 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
o3-mini | 13.4 | 96.3 | 97.3 | 90.8 | 79.2 | 85.9 | 74.9 | 79.7 | / | 87.3 | / | / | / | 49.3 | 97.9 | / | / | 79.5 | / | / | / | / | / |
o3-mini-high | 13.8 | 97.6 | 97.9 | 92.0 | 80.6 | 86.9 | 77.2 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
o3-mini-low | 13.0 | 94.5 | 95.8 | 89.4 | 77.6 | 84.9 | 67.6 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
o4-mini | 20.2 | 97.3 | 97.5 | 93.7 | 77.7 | 90.0 | 77.6 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
o4-mini-high | 19.3 | 99.3 | 98.2 | 93.5 | 78.1 | 90.3 | 81.3 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
o4-mini-low | 20.2 | 95.9 | 96.2 | 93.0 | 76.0 | 89.5 | 73.6 | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / | / |
除了表格结果呈现之外,我们还提供了csv格式下载,并完全提供json文件格式的原始数据以供用户下载。json字段包括了模型名称、评测名、测试方法、来源链接,为读者提供了详尽的可溯源数据集。
本大语言模型评测数据集汇总,将动态更新。欢迎读者不断提供反馈,我们将尽快进行收录。
引用格式
Cheng Xuanda. llm benchmark [Dataset]. Laptop Review, 19 May 2025, https://laptopreview.club/llm-benchmark/
@dataset{ChengLLMDataset,
author = {Cheng Xuanda},
title = {llm benchmark},
year = {2025},
url = {https://laptopreview.club/llm-benchmark/},
note = {Data set, Laptop Review, 2025-05-19}
}
更新日志
2025年5月19日
发布首版。囊括SimpleQA、HumanEval、GPQA、GPQA Diamond、MMLU等数十种评测数据集。包括来自OpenAI、Anthropic、Google的30多个模型。