Content area

Abstract

Large language models (LLMs) have significantly advanced natural language understanding and demonstrated strong problem-solving abilities. Despite these successes, most LLMs still struggle with solving mathematical problems due to the intricate reasoning required. To support rigorous evaluation of mathematical reasoning in LLMs, we introduce the “MathOdyssey” dataset - a curated collection of 387 expert-generated mathematical problems spanning high school, university, and Olympiad-level topics. Each problem is accompanied by a detailed solution and categorized by difficulty level, subject area, and answer type. The dataset was developed through a rigorous multi-stage process involving contributions from subject experts, peer review, and standardized formatting. We provide detailed metadata and a standardized schema to facilitate consistent use in downstream applications. To demonstrate the dataset’s utility, we evaluate several representative LLMs and report their performance across problem types. We release MathOdyssey as an open-access resource to enable reproducible and fine-grained assessment of mathematical capabilities in LLMs and to foster further research in mathematical reasoning and education.

Full text

Turn on search term navigation

© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.