Content area
gem5 is a modular computer system simulator, widely used in computer architecture research—known for its slow simulation time and steep learning curve. This thesis examines two of itssequential CPU models—the AtomicSimpleCPU (AS CPU) and the TimingSimpleCPU (TSCPU) with the goal of: (1) providing an anatomical view of their execution flow, and (2)presenting how execution time is partitioned at each layer of the simulated hardware, therebyexposing its current bottlenecks. The latter is accomplished by profiling each CPU on aselection of benchmark suites and configurations. We employ our new lightweight profiler withlayer-level granularity. Finally, we show how gem5 interacts with the host kernel by logging itssystem calls.
We show that both CPUs spend a significant amount of execution time inside their instructionfetch stage and that the Ruby memory subsystem—used whenever a memory request is made—consumes the largest portion of the overall execution time. Ruby requests trigger a chain of function calls for the AS CPU, whereas the TS CPU emulates timed memory accesses, introducing realistic delays and response event scheduling. For timed memory requests, Ruby uses the Garnet on-chip network model to route packets between the components of the memory hierarchy, further increasing the overhead for applications prone to frequent L1 cache misses. Even for applications that often hit in the L1 cache, we observe that Ruby’s overhead isthe performance bottleneck.
While increasing the memory or number of cores for a given configuration may intuitively appearbeneficial, we reveal that it did not yield any significant performance benefits for either CPU. Our framework provides a foundation for optimising gem5—especially the AS and TS CPUs, and serves as a template for analysing other gem5 components or out-of-order models.