Introducing Comprehensive Dataset Collection for Large Language Model Benchmarks

Large language models are emerging one after another, and various benchmark data are scattered across different blog posts and technical reports from multiple organizations. With so many benchmarks and no unified summary available, users lack a clear and concise comparison of model capabilities.

benchmark	SimpleQA	HumanEval	MATH	MGSM	DROP	MMLU	GPQA	GPQA Diamond	MMMU	AIME 2024	Aider Polyglot	Humanity’s Last Exam	AIME 2025	SWE-bench Verified	MATH 500	LiveCodeBench V5	Vibe-Eval (Reka)	MMMLU	IFEval	TAU-bench	MRCR	Global MMLU (Lite)	Video-MME (Overall)
Claude 3 Opus	23.5	84.9	60.1	90.7	83.1	86.8	50.4	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/
Claude 3.5 Sonnet	28.9	92.0	71.1	91.6	87.1	88.3	59.4	65.0	70.4	16.0	/	/	/	49.0	78.0	/	/	82.1	90.2	48.8	/	/	/
Claude 3.7 Sonnet	/	/	/	/	/	/	/	78.2	75.0	61.3	64.9	8.9	49.5	62.3	96.2	/	/	86.1	93.2	81.2	/	/	/
DeepSeek R1	30.1	/	/	/	/	/	/	71.5	/	79.8	56.9	8.6	70.0	49.2	97.3	64.3	/	/	83.3	/	/	/	/
GPT-4.1	41.6	94.5	82.1	86.9	79.4	90.2	66.3	66.3	75.0	/	52.9	5.4	/	54.6	/	/	/	/	/	/	/	/	/
GPT-4.1-mini	16.8	93.8	81.4	88.2	81.0	87.5	65.0	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/
GPT-4.1-nano	7.6	87.0	62.3	73.0	82.2	80.1	50.3	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/
GPT-4.5-preview	62.5	88.6	87.1	86.9	83.4	90.8	69.5	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/
GPT-4o	38.8	90.2	68.5	90.3	81.5	85.7	46.0	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/
GPT-4o-mini	9.5	87.2	70.2	87.0	79.7	82.0	40.2	81.4	81.6	93.4	58.2	14.3	92.7	/	/	/	/	/	/	/	/	/	/
Gemini 1.0 Ultra	/	74.4	53.2	79.0	82.4	83.7	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/
Gemini 1.5 Flash	/	71.5	40.9	75.5	78.4	77.9	38.6	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/
Gemini 1.5 Pro	/	71.9	58.5	88.7	78.9	81.9	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/
Gemini 2.0 Flash	29.9	/	/	/	/	/	/	60.1	71.7	32.0	22.2	5.1	27.5	/	/	34.5	56.4	/	/	/	74.2	83.4	/
Gemini 2.5 Flash	29.7	/	/	/	/	/	/	78.3	76.7	88.0	44.2	12.1	78.0	/	/	63.5	62.0	/	/	/	84.6	88.4	/
Gemini 2.5 Pro	50.8	/	/	/	/	/	/	83.0	79.6	/	76.5	17.8	83.0	63.2	/	75.6	65.6	/	/	/	93.0	88.6	84.8
Grok 2	/	88.4	76.1	/	/	87.5	56.0	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/
Grok 2 mini	/	85.7	73.0	/	/	86.2	51.0	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/
Grok 3 Beta	43.6	/	/	/	/	/	/	84.6	78.0	93.3	53.3	/	93.3	/	/	79.4	/	/	/	/	/	/	/
Llama 3.1	/	72.6	51.9	68.9	59.5	68.4	30.4	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/
gpt-4	/	86.6	64.5	85.1	81.5	85.4	41.4	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/
gpt-4-turbo	24.2	88.2	73.4	89.6	86.0	86.7	49.3	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/
o1	42.6	/	96.4	89.3	90.2	91.8	75.7	78.0	78.2	83.3	/	/	/	48.9	96.4	/	/	87.7	/	54.2	/	/	/
o1-mini	7.6	92.4	90.0	89.9	83.9	85.2	60.0	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/
o1-preview	42.4	92.4	85.5	90.8	74.8	90.8	73.3	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/
o3	49.4	87.4	97.8	92.3	80.6	92.9	82.8	83.3	82.9	/	79.6	20.3	88.9	69.1	/	/	/	/	/	/	/	/	/
o3-high	48.6	88.4	98.1	92.0	89.8	93.3	83.4	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/
o3-low	49.4	87.3	96.9	91.9	82.3	92.8	78.6	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/
o3-mini	13.4	96.3	97.3	90.8	79.2	85.9	74.9	79.7	/	87.3	/	/	/	49.3	97.9	/	/	79.5	/	/	/	/	/
o3-mini-high	13.8	97.6	97.9	92.0	80.6	86.9	77.2	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/
o3-mini-low	13.0	94.5	95.8	89.4	77.6	84.9	67.6	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/
o4-mini	20.2	97.3	97.5	93.7	77.7	90.0	77.6	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/
o4-mini-high	19.3	99.3	98.2	93.5	78.1	90.3	81.3	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/
o4-mini-low	20.2	95.9	96.2	93.0	76.0	89.5	73.6	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/	/

This website provides a comprehensive collection of large language model benchmark datasets, striving to gather existing benchmark data for available models and presenting them in a clear tabular format. In addition to the table display, we also offer downloads in CSV format, and provide the complete original data in JSON format for users to download. The JSON fields include model name, benchmark name, testing method, and source link, offering readers a detailed and traceable dataset.

llm benchmark Download

result_table Download

This large language model benchmark dataset collection will be updated dynamically. We welcome continuous feedback from readers and will incorporate new data as soon as possible.

Citation Format

Cheng Xuanda. llm benchmark [Dataset]. Laptop Review, 19 May 2025, https://laptopreview.club/introducing-llm-benchmark-dataset/

@dataset{ChengLLMDataset,
  author = {Cheng Xuanda},
  title = {llm benchmark},
  year = {2025},
  url = {https://laptopreview.club/introducing-llm-benchmark-dataset/},
  note = {Data set, Laptop Review, 2025-05-19}
}

Update Log

May 19, 2025

First release. Covers dozens of benchmark datasets including SimpleQA, HumanEval, GPQA, GPQA Diamond, MMLU, etc. Includes over 30 models from OpenAI, Anthropic, Google, and more.

Introducing Comprehensive Dataset Collection for Large Language Model Benchmarks

Citation Format

Update Log

May 19, 2025

相关文章

说明 | 什么是达达物价指数？

数据库 | 笔记本电脑出口量

英特尔酷睿Ultra 5 338H 参数、性能、跑分评测