Content area
Training large AI models is computationally intensive. State-of-the-art language and vision models (LLMs and VLMs) often require thousands of GPUs and weeks or even months of training. As models scale to meet the demands of modern applications, efficient distributed training becomes essential, yet remains highly complex. No single distributed training configuration (or training recipe) works across all combinations of model architectures, hardware platforms, and data modalities. Practitioners must explore a vast configuration space through costly trial and error, often building and tuning implementations manually. Even then, out-of-memory errors and sub-optimal performance are common. This complexity is further compounded by the difficulty of synthesizing efficient implementations for selected configurations. Existing frameworks are fragmented across disparate libraries, lack interoperability, and are difficult to maintain, making the development, evaluation, and reuse of training recipes a significant engineering burden.
This thesis introduces LegoAI, a system that transforms distributed AI training into an automated, scalable, and modular process. Given a model, dataset, and hardware configuration, LegoAI automatically selects the optimal distributed training configuration and generates a production-ready implementation that scales to thousands of GPUs. At its core, LegoAI serves as a synthesis engine: it decomposes state-of-the-art training strategies into modular, composable design principles and unifies them within a single coherent framework. In doing so, LegoAI exposes a vast configuration space that comprises not only existing state-of-the-art algorithms but also entirely new designs beyond them. Through high-fidelity simulation, it predicts memory usage and runtime without requiring execution, enabling fast and safe exploration of the configuration space. Finally, for the empirically optimal configuration, it synthesizes an efficient and scalable implementation. In addition to exploring, comparing, and deploying state-of-the-art algorithms, LegoAI enables full-stack research by analyzing and synthesizing entirely new training algorithms derived from the design space through the composition of existing design principles.
We evaluate LegoAI across diverse models, GPU types (A100, H100), and interconnects (InfiniBand, RoCE), demonstrating strong scalability, accurate simulation, and effective policy synthesis. LegoAI achieves speedups of 65.08%, 12.59%, and 30% over optimized baselines on LLaMA 3.1 models at 128, 256, and 512 GPU scales, respectively. It predicts runtime with over 90% accuracy and memory usage with 99.9% accuracy across hardware configurations. To demonstrate LegoAI's research capabilities, we synthesize new memory-efficient training algorithms based on recomputation that reduce overhead by up to 90% compared to baselines, while achieving superior compute–memory trade-offs by matching ILP-optimal solutions and running over 100× faster.
Thus, LegoAI is the first system to unify the synthesis, simulation, and deployment of distributed training strategies, significantly reducing cost, complexity, and uncertainty while enabling broader and more efficient exploration of the large-scale AI training design space.
