Evaluating Foundational Models Using Insights From Their Pretraining Data

Abstract

Foundational models have demonstrated exceptional performance on established academic benchmarks, often narrowing the gap between human reasoning and artificial intelligence. These models have seamlessly integrated into our daily lives, enabling complex reasoning across domains such as personalized education systems, legal analyses, and scientific discovery. While the success of these models is widely attributed to their scale—encompassing both their architectural parameters and the vast pretraining data—the critical role of pretraining data in shaping their capabilities and limitations is often acknowledged but rarely studied. This is largely due to the large scale and unstructured nature of pretraining datasets, which pose challenges for analyzing and understanding their impact systematically. However, if we cannot disentangle model behavior from their pretraining data, how can we trust these systems in real-world, high-stakes applications?

In this thesis, we argue that understanding the true performance of foundational models requires going beyond conventional benchmark testing. In particular, incorporating insights from their pretraining data is essential for comprehensively evaluating and interpreting the models’ capabilities and limitations.

We focus on large language models (LLMs) and show that while LLMs often excel in benchmark settings, they can fail on basic, trivial reasoning tasks, raising concerns about their true robustness. To better understand these limitations, we examine the relationship between a model’s successes and failures through the lens of its pretraining data. We present methodologies for studying how pretraining data impacts a model’s reasoning performance and introduce Snoopy, a tool designed to facilitate such studies by analyzing the impact of term frequencies on model performance across various tasks. The final part of this thesis focuses on evaluating recent popular multimodal models in the context of chart reasoning. We leverage the understanding gained from earlier analyses to probe these models’ abilities in the task of reasoning with charts. Our findings reveal the limitations of foundational models, particularly their tendency to excel on benchmarks while struggling with fundamental reasoning tasks. By examining how pretraining data impacts model behavior, we emphasize the need for deeper, more granular evaluations to better interpret model performance and capabilities.

Details

Business indexing term

Subject:

Artificial intelligence

Subject

Artificial intelligence;
Computer engineering;
Computer science

Classification

0800: Artificial intelligence
0984: Computer science
0464: Computer Engineering

Identifier / keyword

Evaluation; Large language models; Pretraining data; Foundational models

Title

Evaluating Foundational Models Using Insights From Their Pretraining Data

Author

Razeghi, Yasaman

Number of pages

137

Publication year

2024

Degree date

2024

School code

0030

Source

DAI-B 86/7(E), Dissertation Abstracts International

ISBN

9798302855459

Advisor

Singh, Sameer

Committee member

Futrell, Richard; Fowlkes, Charless C.

University/institution

University of California, Irvine

Department

Computational Science

University location

United States -- California

Degree

Ph.D.

Source type

Dissertation or Thesis

Language

English

Document type

Dissertation/Thesis

Dissertation/thesis number

31762250

ProQuest document ID

3162747316

Document URL

https://www.proquest.com/dissertations-theses/evaluating-foundational-models-using-insights/docview/3162747316/se-2?accountid=208611

Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.

Database

ProQuest One Academic

Evaluating Foundational Models Using Insights From Their Pretraining Data

Content area

Abstract

Details