Structural Insights for LLM Serving Efficiency

Abstract

The widespread adoption of Large Language Models (LLMs) has reshaped the datacenter computing landscape. As these models continue to grow in size and complexity, they require increasingly expensive and power-intensive infrastructure. Hence, serving LLMs efficiently has become critical for managing costs and resource constraints in modern datacenters.

In this dissertation, I argue that serving efficiency can be significantly improved by designing systems that are aware of the distinct phases of generative LLM inference: a compute-intensive prefill phase and a memory-intensive decode phase. Exploiting the unique properties of these phases unlocks significant performance gains at scale.

My research validates this thesis through three studies. First, I address power constraints, a key bottleneck to datacenter growth. By analyzing how the distinct power demands of prefill and decode phases aggregate, I show that inference cluster power is underutilized. Based on this observation, I develop a power oversubscription framework that safely adds more servers under existing power budgets, increasing inference cluster capacity with minimal performance impact.

Second, I show that running the compute-bound prefill and memory-bound decode phases on the same hardware leads to poor performance and resource stranding. To address these overheads, I introduce a new inference cluster architecture that disaggregates the phases onto hardware fleets specialized to better manage resources for each phase. This phase-separated cluster design yields substantial efficiency improvements over traditional approaches.

Third, I extensively analyze the unique inefficiencies caused by conditional computation in Mixture-of-Experts (MoE) models, which I formalize as the MoE tax. This tax manifests differently across the two phases, for instance, creating load imbalance in prefill and increased memory transfers in decode. Based on this analysis, I propose phase-specific optimizations to address these bottlenecks and improve the efficiency of serving MoE models at scale.

Collectively, these studies demonstrate that phase awareness is a key principle for designing efficient generative LLM serving systems.

Details

Subject

Computer science;
Computer engineering;
Information technology

Classification

0984: Computer science
0464: Computer Engineering
0489: Information Technology

Identifier / keyword

Computer architecture; Computer systems; Large Language Models; Mixture-of-Experts; Power management

Title

Structural Insights for LLM Serving Efficiency

Author

Patel, Pratyush

Number of pages

163

Publication year

2025

Degree date

2025

School code

0250

Source

DAI-B 87/3(E), Dissertation Abstracts International

ISBN

9798293847723

Advisor

Krishnamurthy, Arvind

Committee member

Ceze, Luis; Tessaro, Stefano

University/institution

University of Washington

Department

Computer Science and Engineering

University location

United States -- Washington

Degree

Ph.D.

Source type

Dissertation or Thesis

Language

English

Document type

Dissertation/Thesis

Dissertation/thesis number

32238746

ProQuest document ID

3251644012

Document URL

https://www.proquest.com/dissertations-theses/structural-insights-llm-serving-efficiency/docview/3251644012/se-2?accountid=208611

Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.

Database

ProQuest One Academic

Structural Insights for LLM Serving Efficiency

Content area

Abstract

Details