Content area

Abstract

In this paper, we present a comprehensive overview of artificial intelligence (AI) computing systems for large language models (LLMs) training. The rapid advancement of LLMs in recent years, coupled with the widespread adoption of algorithms and applications such as BERT, ChatGPT, and DeepSeek, has sparked significant interest in this field. We classify LLMs into encoder-only, encoder-decoder, and decoder-only models, and briefly analyze their training and inference processes to emphasize their substantial need for computational resources. These operations depend heavily on AI-specific accelerators like GPUs (graphics processing units), TPUs (tensor processing units), and MLUs (machine learning units). However, as the gap widens between the increasing complexity of LLMs and the current capabilities of accelerators, it becomes essential to adopt heterogeneous computing systems optimized for distributed environments to manage the growing computational and memory requirements of LLMs. We delve into the execution and scheduling of LLM algorithms, underlining the critical role of distributed computing strategies, memory management enhancements, and boosting computational efficiency. This paper clarifies the complex relationship between algorithm design, hardware infrastructure, and software optimization, and provides an in-depth understanding of both the software and hardware infrastructure supporting LLMs training, offering insights into the challenges and potential avenues for future development and deployment.

Details

10000008
Business indexing term
Title
AI Computing Systems for Large Language Models Training
Volume
40
Issue
1
Pages
6-41
Publication year
2025
Publication date
Jan 2025
Publisher
Springer Nature B.V.
Place of publication
Beijing
Country of publication
Netherlands
ISSN
10009000
e-ISSN
18604749
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2025-03-12
Milestone dates
2024-03-11 (Registration); 2024-02-08 (Received); 2025-01-05 (Accepted)
Publication history
 
 
   First posting date
12 Mar 2025
ProQuest document ID
3176454874
Document URL
https://www.proquest.com/scholarly-journals/ai-computing-systems-large-language-models/docview/3176454874/se-2?accountid=208611
Copyright
Copyright Springer Nature B.V. Jan 2025
Last updated
2025-03-13
Database
ProQuest One Academic