Content area

Abstract

The increasing demand for high-speed networks driven by applications such as virtual reality, deep neural network training, and high-resolution video streaming has pushed the limits of the current network infrastructure. Software packet processing remains a critical bottleneck, with software speeds struggling to keep pace with the rapidly increasing line speeds of emerging network interface cards (NICs), due to the stagnation of Moore’s law and the end of Dennard scaling. We improve the speed of software packet processing by focusing on two key areas: (1) optimizing low-level sequential code and (2) scaling performance across cores through packet-level parallelism, while ensuring kernel safety.

This thesis focuses on packet-processing programs designed to handle heavy network traffic with fast packet turnaround times. Specifically, we are interested in applications with a “hairpin” traffic flow: receiving, quickly processing, and immediately transmitting packets at the highest possible rate. Examples of such program include load balancers, DDoS mitigators, network address translation, and so on. These programs have smaller compute and memory requirements compared to generic network stack or applications like TCP stacks running at endpoints. Such fast packet processing programs today are developed in specialized frameworks such as DPDK (Data Plane Development Kit) and eBPF (Extended Berkeley Packet Filter), which bypass the generic code paths and functionality of the operating system in favor of software efficiency. However, reducing the role of the OS weakens the isolation guarantees (hence, safety guarantees) provided for applications by the OS. To mitigate the lack of isolation, eBPF uses an in-kernel static checker. The checker allows a program to execute only if it can prove that the program is crash-free, always accesses memory within safe bounds, and avoids leaking kernel data.

To achieve optimal low-level sequential code efficiency, we introduce K2 (ACM SIG-COMM 2021), a compiler that leverages stochastic program synthesis and first-order logic to generate BPF1 code with both optimal performance (according to a stated objective function) and kernel safety guarantees. Traditional optimizing compilers for BPF often fall short of providing optimal performance since the kernel verifier’s safety constraints are incompatible with rule-based optimizations. K2 reduces code size by 6–26%, lowers packet-processing latency by 1.36–55.03%, and increases throughput by up to 4.75% compared to the best Clang-compiled programs. K2 incorporates several domain-specific techniques to make synthesis practical by accelerating equivalence-checking of BPF programs by 6 orders of magnitude. Systematic search (e.g., program synthesis) can find better programs at the expense of additional time (the average search time of K2 is 10,096 seconds).

To achieve better parallel scaling of packet processing performance, we propose SCR (State-Compute Replication, NSDI 2025), a principle for distributing the processing of stateful flows (i.e., packets that read and update the same memory) across multiple CPU cores. The prevailing method to scale throughput with multiple cores involves state sharding, processing the same stateful flow at the same core. However, given the skewed nature of realistic flow size distributions, this method is untenable, since total throughput is limited by single-core performance. SCR utilizes a packet history sequencer that we designed to enable state updates without explicit synchronization across CPU cores, allowing throughput to scale linearly with the number of cores, independent of flow size distributions. More specifically, SCR uses the packet history sequencer to embed history packet fields related to stateful operations into a packet and to spray packets in a round-robin fashion across cores, enabling the core to replicate state computations and reconstruct the latest state before processing the packet. The “simple” replication works since the additional CPU work (i.e., reconstructing the latest state) required by one packet is minor compared to the entire per-packet CPU work.

Improving software packet processing throughput has become challenging due to the stagnation of Moore’s Law and the end of Dennard scaling. These trends have limited traditional computational performance growth, pushing system designers to explore innovative optimization techniques. This thesis demonstrates that significant improvements with safety guarantee are still achievable by leveraging domain-specific methods, novel compiler technologies, and emerging programmable hardware.

Details

1010268
Title
Accelerating Software Packet Processing for High-Speed Networks
Number of pages
151
Publication year
2025
Degree date
2025
School code
0190
Source
DAI-B 87/1(E), Dissertation Abstracts International
ISBN
9798288804519
Committee member
Martin, Richard; Zhu, He; Tahmasbi Arashloo, Mina
University/institution
Rutgers The State University of New Jersey, School of Graduate Studies
Department
Computer Science
University location
United States -- New Jersey
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
31840561
ProQuest document ID
3228976361
Document URL
https://www.proquest.com/dissertations-theses/accelerating-software-packet-processing-high/docview/3228976361/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic