Content area

Abstract

Modern AI workloads—such as Reinforcement Learning (RL) and Large Language Models (LLMs)—have become foundational to applications spanning robotics, autonomous systems, scientific discovery, and human-computer interaction. While they differ in learning paradigms—RL focuses on decision-making through interaction, whereas LLMs rely on statistical learning from large datasets—both classes of models increasingly demand significant compute, memory, and energy resources. As these workloads grow in scale and complexity, they begin to expose similar systems-level challenges, including inefficient memory access, limited cache utilization, and high energy overhead. Motivated by these common bottlenecks, this dissertation explores a natural progression from RL to LLMs, unifying them under the lens of hardware-software co-design for scalable and efficient AI.

RL and MARL systems are widely applied in domains such as autonomous vehicles, robotics, game theory, and resource management, where they enable agents to learn optimal policies through interactions with dynamic environments. However, recent studies have shown that these workloads suffer inefficiencies that can limit their adoption in real systems. These bottlenecks occur due to complexities in decision-making processes arising from having to observe and act upon a large number of events present in the environment during the training phase, along with the growth in the number of AI agents needed to interact with each other.

To understand the landscape of multi-agent systems, the first section of the dissertation conducts a detailed workload characterization on processor-centric systems to analyze the end-to-end training time, gain insights into cache efficiency, and explore the scalability aspects of online multi-agent systems. We identify a key performance bottleneck in the transition data sampling phase that dominates the overall training time of various MARL systems, and this phase is heavily influenced by irregularity in memory access patterns.

The second part of the dissertation explores opportunities for performance optimizations such as temporal locality-aware sampling, data layout reorganization, and spatial locality-aware sampling to mitigate the performance bottlenecks identified in our workload characterization. Through these performance optimizations, we reduce end-to-end training times for multi-agent RL workloads without sacrificing learning performance. Additionally, we shift our focus to memory-centric computing systems, specifically Processing-In-Memory (PIM) architectures, to tackle the memory bottlenecks of offline RL workloads, where learning from large transition datasets often causes frequent data transfers between caches and memory units. We target widely-used algorithms like Q-learning and SARSA, used in applications such as financial trading and smart grids. By adapting these workloads to PIM architectures, we observe scalable performance and develop a multi-agent version of Q-learning optimized for hardware. This highlights the potential of PIM systems to accelerate RL training in both single- and multi-agent settings.

Extending this systems-level lens to language models, the final section of the dissertation examines the fine-tuning of LLMs—an increasingly critical but resource-intensive phase in model deployment. We evaluate techniques such as low-rank adaptation, activation checkpointing, and mixed-precision optimizations. These methods collectively reduce memory and energy over heads while maintaining fine-tuning performance across diverse model architectures.

By unifying RL and LLMs under a common systems framework, this work underscores the importance of architectural awareness in AI model development. Despite their architectural differences, RL agents and LLMs exhibit common system-level bottlenecks—such as high data movement costs, and compute-memory trade-offs—making a co-optimized hardware-software stack critical for the future of real-world AI deployment.

Details

1010268
Business indexing term
Title
Performance Tuning for Next-Gen AI: System-Level Optimization and Profiling for Reinforcement Learning and Language Models Through Hardware–Software Co-Design
Author
Number of pages
176
Publication year
2025
Degree date
2025
School code
0075
Source
DAI-B 87/1(E), Dissertation Abstracts International
ISBN
9798288857201
Committee member
Lan, Tian; Wu, Nan; Wei, Peng; Kayiran, Onur
University/institution
The George Washington University
Department
Computer Engineering
University location
United States -- District of Columbia
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32164370
ProQuest document ID
3231984849
Document URL
https://www.proquest.com/dissertations-theses/performance-tuning-next-gen-ai-system-level/docview/3231984849/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic