Content area
Full Text
Abstract - As computer systems grow to multi-core and multinode systems for higher performance, Non-Uniform Memory Access (NUMA) architecture is widely adopted in recent computer systems. In NUMA architectures, typical lock primitives such as spinlock are major performance bottleneck by causing frequent inter-node traffic. For improving performance in NUMA architectures, lock algorithm is cautiously considered to minimize cache coherence traffic and cache coherence misses.
In this paper, we examine two locks designed for NUMA architecture: Cohort Lock and NUMA-Aware Reader-Writer Lock. We analyze the performance with various read-write workload ratios using a microbenchmark. We also evaluate with various settings to improve throughput in two tunable parameters: handoff bound in Cohort Lock and maximum patience limit in NUMA-Aware Reader-Writer Lock. Evaluation shows that throughput is improved up to 3% when we apply appropriate settings in tunable parameters.
Keywords: NUMA, reader-writer lock
1 Introduction
Currently, most high performance systems use hardware platforms with two or more sockets and nodes utilizing several cores per socket and node. In this hardware platform, multi-sockets and multi-nodes system incorporates NonUniform Memory Access (NUMA) architecture. NUMA architecture is structured as two or more nodes which are consisted of several cores, a local cache and memory. Local cache and memory of each node can be accessed to other nodes by the medium of interconnect; however, the access time to inter-node is slower than access to intra-node.
As the number of cores grows up in a NUMA system, parallel programming is becoming a key issue for performance since cores operate at the same time. Several applications are designed for parallel programming, but performance is limited by shared resource that can be accessed simultaneously. Especially, typical lock algorithm for shared resource is restricted for performance caused by long access latencies in NUMA system. The reason for long access latencies is caused by inter-node traffic slower than intra-node traffic. Therefore, using intra-node cache, reducing cache coherence traffic, and minimizing cache coherence misses are key factors for improving performance of lock algorithm.
MCS lock [3] is widely known as scalable in shared resource. MCS lock maintains a queue based list to wait for acquiring the lock. The head of the queue-based list is a thread that holds the lock, and the thread which tries to acquire...