Reasoning Model Evaluations

This website compiles available evidence on how o1's reasoning capabilities compare to previous models. The evidence is organized by domain and includes both improvements and areas without significant progress. Each entry includes links to sources and detailed findings.

Note: There is a selection bias in the available evaluations, as researchers focus on tasks where they anticipate improvements and may be less likely to report negative results. The evidence presented here should be interpreted with this limitation in mind.

Sort A-Z Sort Z-A Filter by Domain Show All Coding Reliability Mathematics Cybersecurity Planning Legal Medical Translation Reproducibility GAIA Safety Toxicity Writing Human Tasks Scaling QA	Sort A-Z Sort Z-A Filter by Type Show All Analysis Benchmark Internal Evaluation Official Evaluation Research Paper	Filter by Model Show All o1-preview o1-mini o1-pro o1 DeepSeek R1-Lite-Preview	Description	Source	Status
Coding	Internal Evaluation	o1-preview o1-mini	Evaluation of o1's reasoning capabilities with Devin using cognition-golden benchmark	Cognition Labs	Improvement
Cognition Labs evaluated their SWE agent Devin with an internal benchmark called 'cognition-golden'. They created this evaluation to model economically valuable tasks. More details below. Cognition Labs evaluation of Devin Key Findings: Improved reflection and analysis capabilities Better at backtracking and considering different options Reduced hallucination and confident incorrectness Better at diagnosing root causes vs addressing symptoms Evaluation harness: 'cognition-golden' benchmark with realistic, economically valuable tasks Tests on large codebases (millions of lines) Fully reproducible environments with autonomous feedback Uses simulated users for interaction testing Employs agent-based evaluation with visual verification
Coding	Official Evaluation	o1-preview o1 o1-pro	OpenAI's official evaluation of o1 on coding tasks	OpenAI	Improvement
Part of the official OpenAI o1 evaluations focused on coding and programming, where o1 achieved notable improvements. In the International Olympiad in Informatics (IOI), it ranked competitively under standard conditions and performed even better with relaxed constraints. In Codeforces evaluations, o1 surpassed prior models. Codeforces Competition Codeforces Elo Comparison Graph o1 and o1-pro results on coding, math, and qa tasks
Coding	Benchmark	o1-preview o1-mini	Performance on USACO programming competition tasks	USACO Results	Improvement
On USACO, o1 outperformed GPT-4 and increased the Pass@1 accuracy from 11.2% to 33.88% USACO Pass@1 Accuracy
Reliability	Official Evaluation	o1-preview o1 o1-pro	OpenAI's evaluation of model reliability	OpenAI	Improvement
Official evaluations showing improved worst of 4 performance on math, coding, and qa tasks Worst of 4 performance of o1 series of models as reported by OpenAI o1 and o1-pro results on coding, math, and qa showing with 4/4 performance
QA	Official Evaluation	o1-preview o1 o1-pro	OpenAI's evaluation of question answering capabilities	OpenAI	Improvement
o1 improves over GPT-4o on a wide range of benchmarks, including 54/57 MMLU subcategories. OpenAI evaluation results on QA benchmarks o1 and o1-pro results on coding, math, and qa tasks
Mathematics	Official Evaluation	o1-preview o1 o1-pro	OpenAI's evaluation of mathematical reasoning capabilities	OpenAI	Improvement
Official evaluations showing significant improvements in mathematical reasoning. OpenAI's evaluation of o1's mathematical reasoning capabilities o1 and o1-pro results on coding, math, and qa tasks
Cybersecurity	Research Paper	o1-preview o1-mini	Evaluation by Turing Institute on cybersecurity capabilities	Turing Institute	Improvement
The key evaluation findings for the OpenAI o1 model in this paper highlight good performance in automated software exploitation tasks. Using DARPA’s AI Cyber Challenge (AIxCC) framework, o1-preview achieved the highest success rate among tested models, solving 64.71% of the challenge project vulnerabilities (CPVs). This performance significantly surpassed other models. The o1-mini variant, though more cost-efficient, showed reduced efficacy. Turing Cybersecurity
Cybersecurity	Benchmark	o1-preview	Offical results from the CyBench cybersecurity benchmark leaderboard	CyBench	No Improvement
No significant improvements shown in the CyBench benchmark. Cybench includes 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties. This task requires o1 being used as part of an agent scaffold. CyBench leaderboard
Planning	Research Paper	o1-preview o1-mini	Analysis of planning and reasoning capabilities on various benchmarks	Rao et al.	Improvement
o1 performance on Blocksworld and other planning benchmarks Blocksworld Performance: 97.8% accuracy on standard Blocksworld tasks, significantly outperforming prior LLMs Performance dropped to 52.8% on obfuscated Mystery Blocksworld tasks Only 23.6% success rate on larger problems requiring longer plans Unsolvable Task Detection: Correctly identified 27% of unsolvable instances Generated incorrect plans in 54% of unsolvable scenarios Shows limited reliability in identifying impossible tasks Efficiency and Cost Analysis: Significantly higher computational costs compared to traditional LLMs Classical planners like Fast Downward remain orders of magnitude faster and cheaper Trade-off between improved reasoning capabilities and computational efficiency Detection of unsolvable tasks Overall, while o1 shows substantial improvements over previous LLMs in structured reasoning, it faces challenges in scalability, efficiency, and reliability, especially on complex or obfuscated tasks.
Planning	Benchmark	o1-preview o1-mini	Results from the MR-Ben meta-reasoning benchmark	Zeng et al.	Improvement
Improved performance in meta-reasoning and system-2 thinking tasks. The paper introduces MR-Ben, a meta-reasoning benchmark designed to evaluate the reasoning capabilities of large language models (LLMs) across diverse domains, including natural sciences, coding, and logic. Unlike traditional benchmarks that focus solely on accuracy, MR-Ben assesses the reasoning process itself by requiring models to identify and analyze errors in reasoning chains. Performance of o1 on MR-Ben
Legal Medical	Research Paper	o1-preview	Results from Japanese certification examination for 'Operations Chief of Radiography With X-rays	Goto et al.	Improvement
Improved capabilities in medical image processing and analysis. The overall accuracy rates of GPT-4o and o1-preview ranged from 57.5% to 70.0% and from 71.1% to 86.5%, respectively. The GPT-4o achieved passing accuracy rates in the subjects except for relevant laws and regulations. In contrast, o1-preview met the passing criteria across all four sets, despite graphical questions being excluded from scoring. The accuracy of all questions and relevant laws and regulations in o1-preview were significantly higher than those in GPT-4o (p = 0.03 for all questions and p = 0.03 for relevant laws and regulations, respectively). No significant differences in accuracy were found across the other subjects.
Reproducibility	Benchmark	o1-mini	Computational reproducibility evaluation on CORE-Bench	CORE-Bench Leaderboard	No Improvement
No significant improvements in computational reproducibility benchmark on CORE-Bench. Claude 3.5 Sonnet clearly outperforms o1-mini getting 37.8% accuracy while o1-mini only gets 24.4% accuracy.
General AI Assistant	Benchmark	o1-mini	GAIA benchmark evaluation	Our internal evals	No Improvement
No significant improvements shown on GAIA benchmark with standard agent scaffolding. GAIA is a benchmark for General AI Assistants. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. o1-mini get 37% accuracy while Claude 3.5 Sonnet gets 58% accuracy and GPT-4o gets 35% accuracy.
Writing	Official Evaluation	o1-preview	OpenAI's evaluation of personal writing and text editing capabilities	OpenAI	No Improvement
No significant improvements in personal writing and text editing tasks. They evaluated human preference of o1-preview vs GPT-4o on challenging, open-ended prompts in a number of domains. In this evaluation, human trainers were shown anonymized responses to a prompt from o1-preview and GPT-4o, and voted for which response they preferred. o1-preview is not preferred on some natural language tasks like writing and text editing. Personal writing and text editing tasks
Human Tasks	Research Paper	o1-preview	Tasks where human intuition typically performs better	Liu et al.	No Improvement
This paper investigates the impact of chain-of-thought (CoT) prompting on the performance of large language models (LLMs) across six task categories inspired by cognitive psychology. Key findings include: Performance Decreases with CoT: CoT significantly reduced performance in tasks like implicit statistical learning, facial recognition, and classifying data with exceptions—tasks where verbal reasoning also impairs human performance. For instance, CoT decreased OpenAI o1-preview's accuracy by 36.3% in a grammar learning task. Performance of o1-preview on grammar learning task
Mathematics	Analysis	o1-mini	Analysis of performance scaling and compute requirements	Scaling Laws	Analysis
They evaluate on the 30 questions that make up the 2024 American Invitation Mathematics Examination (AIME). Using OpenAI's o1-mini, accuracy improves as test-time token budgets increase up to ~2^17 tokens. However, performance plateaus at ~70% accuracy beyond this point, even with self-consistency techniques like majority voting. This aligns with prior findings that such methods saturate after initial gains, emphasizing diminishing returns for extended inference budgets. Performance of o1-mini on AIME reproduced from the offical evaluations and with larger token budgets. The curve flattens past 70% accuracy.
QA	Analysis	o1-preview	EpochAI evaluation of o1-preview vs GPT-4o	EpochAI	Analysis
They plotted GPQA accuracy against the number of output tokens generated for both methods, and compared it to o1-preview's GPQA accuracy. While both methods improved GPT-4o's accuracy, they still significantly underperformed o1-preview at inference compute parity. Naively scaling inference compute isn't enough. o1-preview's superior performance likely stems from advanced RL techniques and better search methods.
Mathematics Coding QA	Offical Evaluation	R1-Lite-Preview	DeepSeek's offical evaluations of their R1 reasoning model	DeepSeek	Improvement
DeepSeek's R1-Lite-Preview outperforms GPT-4o on all tasks related to coding, math, and QA. It is also competitive with o1-preview. DeepSeek's R1-Lite-Preview outperforms GPT-4o on all tasks. Inference Scaling Laws of DeepSeek-R1-Lite-Preview Longer Reasoning, Better Performance. DeepSeek-R1-Lite-Preview shows steady score improvements on AIME as thought length increases. DeepSeek's R1-Lite-Preview shows steady score improvements on AIME as thought length increases.
Safety	Offical Evaluation	o1-preview	OpenAI's offical safety evaluations	OpenAI	Improvement
OpenAI conducted a set of safety tests and red-teaming before deployment, in accordance with their Preparedness Framework. They found that chain of thought reasoning contributed to capability improvements across our evaluations. o1-preview safety evaluations
Toxicity	Research Paper	Other	Eval of bias and toxicity in zero-shot reasoning with CoT	Shaikh et al.	No Improvement
The authors conducted experiments on various datasets to evaluate the bias and toxicity of zero-shot reasoning with CoT. They found that CoT reasoning significantly decreased the likelihood of generating non-toxic answer. Results of how CoT reasoning affects toxicity in zero-shot reasoning