1. Introduction
In today’s parallel programming, a variety of general purpose Application Programming Interfaces (APIs) are widely used, such as OpenMP, OpenCL for shared memory systems including CPUs and GPUs, CUDA, OpenCL, OpenACC for GPUs, MPI for cluster systems or combinations of these APIs such as: MPI+OpenMP+CUDA, MPI+OpenCL [1], etc. On the other hand, these APIs allow to implement a variety of parallel applications falling into the following main paradigms: master–slave, geometric single program multiple data, pipelining, divide-and-conquer.
At the same time, multi-core CPUs have become widespread and present in all computer systems, both desktop and server type systems. For this reason, optimization of implementations of such paradigms on such hardware is of key importance nowadays, especially as such implementations can serve as templates for coding specific domain applications. Consequently, within this paper we investigate various implementations of one of such popular programming patterns—master–slave, implemented with one of the leading APIs for programming parallel applications for shared memory systems—OpenMP [2].
2. Related Work
Works related to the research addressed in this paper can be associated with one of the following areas, described in more detail in subsequent subsections:
frameworks related to or using OpenMP that target programming abstractions even easier to use or at a higher level than OpenMP itself;
parallelization of master–slave and producer–consumer in OpenMP including details of analyzed models and proposed implementations.
2.1. OpenMP Related Frameworks and Layers for Parallelization
SkePU is a C++ framework targeted at using heterogeneous systems with multi-core CPUs and accelerators. In terms of processing structures, SkePU incorporates several skeletons such as Map, Reduce, MapReduce, MapOverlap, Scan and Call. SkePU supports back-ends such as: sequential CPU, OpenMP, CUDA and OpenCL. Work [3] describes a back-end for SkePU version 2 [4] that allows to schedule workload on CPU+GPU systems. Better performance than the previous SkePU implementation is demonstrated. The original version of hybrid execution in SkePU 1 ran StarPU [5] as a back-end. The latter allows to encompass input data with codelets that can be programmed with C/C++, OpenCL and CUDA. Eager, priority and random along with caching policies are available for scheduling. Workload can be partitioned among CPUs and accelerators either automatically or manually with a given ratio. Details of how particular skeletons are assigned in shown in [3]. In terms of modeling, autotuning is possible using a linear performance formulation and its speed-ups were shown versus CPU and accelerator implementations.
OpenStream is an extension of OpenMP [6,7] enabling to express highly dynamic control and data flows between dependent and nested tasks. Additional
Argobots [8] is a lightweight threading layer that can be used for efficient processing and coupling high-level programming abstractions to low-level implementations. Specifically, the paper shows that the solution outperforms state-of-the-art generic lightweight threading libraries such as MassiveThreads and Qthreads. Additionally, integrating Argobots with MPI and OpenMP is presented with better performance of the latter for an application with nested parallelism than competing solutions. For the former configuration, better opportunities regarding reduction of synchronization and latency are shown for Argobots compared to Pthreads. Similarly, better performance of Argobots vs Pthreads is discussed for I/O operations.
PSkel [9], as a framework, targets parallelization of stencil computations, in a CPU+GPU environment. As its name suggests, it uses parallel skeletons and can use NVIDIA CUDA, Intel TBB and OpenMP as back-ends. Authors presented speed-ups of the hybrid CPU+GPU version up to up to 76% and 28%, versus CPU-only and GPU-only codes.
Paper [10] deals with introduction of another source layer between a program with OpenMP constructs and actual compilation. This approach translates OpenMP constructs into an intermediate layer (NthLib is used) and the authors are advocating future flexibility and ease of introduction of changes into the implementation in the intermediate layer.
Optimization of OpenMP code can be performed at a lower level of abstraction, even targeting specific constructs. For example, in paper [11] authors present a way for automatic transformation of code for optimization of arbitrarily-nested loop sequences with affine dependencies. An Integer Linear Programming (ILP) formulation is used for finding good tiling hyperplanes. The goal is optimization of locality and communication-minimized coarse-grained parallelization. Authors presented notable speed-ups over state-of-the-art research and native compilers ranging from 1.6x up to over 10x, depending on the version and benchmark. Altogether, significant gains were shown for all of the following codes: 1-d Jacobi, 2-d FDTD, 3-d Gauss-Seidel, LU decomposition and Matrix Vec Transpose.
In paper [12], the author presented a framework for automatic parallelization of divide-and-conquer processing with OpenMP. The programmer needs to provide code for key functions associated with the paradigm, i.e., data partitioning, computations and result integration. Actual mapping of computations onto threads is handled by the underlying runtime layer. The paper presents performance results for an application of parallel adaptive quadrature integration with an irregular and imbalanced corresponding processing tree. Obtained speed-ups using an Intel Xeon Phi system reach around 90 for parallelization of an irregular adaptive integration code which was compared to a benchmark without thread management at various levels of the divide-and-conquer tree which resulted in maximum speed-ups of 98.
OmpSs [13] (
Paper [17] presents PLASMA—the Parallel Linear Algebra Software for Multicore Architectures—a version which is an OpenMP task based implementation adopting a tile-based approach to storage, along with algorithms that operate on tiles and use OpenMP for dynamic scheduling based on tasks with dependencies and priorities. Detailed assessment of the software performance is presented in the paper using three platforms with 2 × Intel Xeon CPU E5-2650 v3 CPUs at 2.3 GHz, Intel Xeon Phi 7250 and 2 × IBM POWER8 CPUs at 3.5 GHz, respectively, using gcc compared to MKL (for Intel) and ESSL (for IBM). PLASMA resulted in better performance for algorithms suited for its tile type approach such as factorization as well as QR factorization in the case of tall and skinny matrices.
In [18] authors presented parts of the first prototype of sLaSs library with auto tunable implementations of operations for linear algebra. They used OmpSs with its task based programming model and features such as weak dependencies and regions with the final clause. They benchmarked their solution using a supercomputer featuring nodes with 2 sockets with Intel Xeon Platinum 8160 CPUs, with 24 cores and 48 logical processors. Results are shown for TRSM for th original LASs, sLaSs and PLASMA, MKL and ATLAS, for NPGETRF for LASs, sLaSs and MKL and for NPGESV for LASs and sLaSs demonstrating improvement of the proposed solution of about 18% compared to LASs.
2.2. Parallelization of Master–Slave with OpenMP
master–slave can be thought of as a paradigm to enable parallelization of processing among independently working slaves that receive input data chunks from the master and return results to the master.
OpenMP by itself offers ways of implementing the master–slave paradigm, in particular using:
#pragma omp parallel along with#pragma omp master directives or#pragma omp parallel with distinguishing master and slave codes based on thread ids.#pragma omp parallel with threads fetching tasks in a critical section, a counter can be used to iterate over available tasks. In [19], it is called an all slave model.Tasking with the
#pragma omp task directive.Assignment of work through dynamic scheduling of independent iterations of a
for loop.
In [19], the author presented virtually identical and almost perfectly linear speed-up of the all slave model and the (dynamic,1) loop distribution for the Mandelbrot application on 8 processors. In our case, we provide extended analysis of more implementations and many more CPU cores.
In work [20], authors proposed a way to extend OpenMP for master–slave programs that can be executed on top of a cluster of multiprocessors. A source-to-source translator translates programs that use an extended version of OpenMP into versions with calls to their runtime library. OpenMP’s API is proposed to be extended with
OpenMP will typically be used for parallelization within cluster nodes and integrated with MPI at a higher level for parallelization of master–slave computations among cluster nodes [1,21]. Such a technique should yield better performance in a cluster with multi-core CPUs than an MPI only approach in which several processes are used as slaves as opposed to threads within a process communicating with MPI. Furthermore, overlapping communication and computations can be used for earlier sending out data packets by the master for hiding slave idle times. Such a hybrid MPI/OpenMP scheme has been further extended in terms of dynamic behavior and malleability (ability to adapt to a changing number of processors) in [22]. Specifically, the authors have implemented a solution and investigated MPI’s support in terms of needed features for an extended and dynamic master/slave scheme. A specific implementation was used which is called WaterGAP that computes current and future water availability worldwide. It partitions the tested global region in basins of various sizes which are forwarded to slaves for independent (fro other slaves) processing. Speed-up is limited by processing of the slave that takes the maximum of slaves’ times. In order to deal with load imbalance, dynamic arrival of slaves has been adopted. The master assigns the tasks by size, from the largest task. Good allocation results in large basins being allocated to a process with many (powerful) processors, smaller basins to a process with fewer (weaker) processors. If a more powerful (in the aforementioned sense) slave arrives, the system can reassign a large basin. Furthermore, slave processes can dynamically split into either processes or threads for parallelization. The authors have concluded that MPI-2 provides needed support for these features apart from a scenario of sudden withdrawal of slaves in the context of proper finalization of an MPI application. No numerical results have been presented though.
In the case of OpenMP, implementations of master–slave and the producer–consumer pattern might share some elements. A buffer could be (but does not have to be) used for passing data between the master and slaves and is naturally used in producer–consumer implementations. In master–slave, the master would typically manage several data chunks ready to be distributed among slaves while in producer–consumer producer or producers will typically add one data chunk at a time to a buffer. Furthermore, in the producer–consumer pattern consumers do not return results to the producer(s). In the producer–consumer model we typically consider one or more producers and one or more consumers of data chunks. Data chunk production and consuming rates/speeds might differ, in which case a limited capacity buffer is used into which producer(s) inserts() data and consumer(s) fetches() data from for processing.
Book [1] contains three implementations of the master–slave paradigm in OpenMP. These include the designated-master, integrated-master and tasking, also considered in this work. Research presented in this paper extends directly those OpenMP implementations. Specifically, the paper extends the implementations with the dynamic-for version, as well as versions overlapping merging and data generation—tasking2 and dynamic-for2. Additionally, tests within this paper are run for a variety of thread affinity configurations, for various compute intensities as well as on four multi-core CPU models, of modern generations, including Kaby Lake, Coffee Lake, Broadwell and Skylake.
There have been several works focused on optimization of tasking in OpenMP that, as previously mentioned, can be used for implementation of master–slave. Specifically, in paper [23], authors proposed extensions of the tasking and related constructs with dependencies
3. Motivations, Application Model and Implementations
It should be emphasized that since the master–slave processing paradigm is widespread and at the same time multi-core CPUs are present in practically all desktops and workstations/cluster nodes thus it is important to investigate various implementations and determine preferred settings for such scenarios. At the same time, the processor families tested in this work are in fact representatives of the new generations CPUs in their respective CPU lines. The contribution of this work is experimental assessment of performance of proposed master–slave codes using OpenMP directives and library calls, compiled with gcc and
The model analyzed in this paper distinguishes the following conceptual steps, that are repeated:
Master generates a predefined number of data chunks from a data source if there is still data to be fetched from the data source.
Data chunks are distributed among slaves for parallel processing.
Results of individually processed data chunks are provided to the master for integration into a global result.
It should be noted that this model, assuming that the buffer size is smaller than the size of total input data, differs from a model in which all input data is generated at once by the master. It might be especially well suited to processing, e.g., data from streams such as from the network, sensors or devices such as cameras, microphones etc.
Implementations of the Master–Slave Pattern with OpenMP
The OpenMP-based implementations of the analyzed master–slave model described in Section 3 and used for benchmarking are as follows:
designated-master (Figure 1)—direct implementation of master–slave in which a separate thread is performing the master’s tasks of input data packet generation as well as data merging upon filling in the output buffer. The other launched threads perform slaves’ tasks.
integrated-master (Figure 2)—modified implementation of the designated-master code. Master’s tasks are moved to within a slave thread. Specifically, if a consumer thread has inserted the last result into the result buffer, it merges the results into a global shared result, clears its space and generates new data packets into the input buffer. If the buffer was large enough to contain all input data, such implementation would be similar to the all slave implementation shown in [19].
tasking (Figure 3)—code using the tasking construct. Within a region in which threads operate in parallel (created with
#pragma omp parallel ), one of the threads generates input data packets and launches tasks (in a loop) each of which is assigned processing of one data packet. These are assigned to the aforementioned threads. Upon completion of processing of all the assigned tasks, results are merged by the one designated thread, new input data is generated and the procedure is repeated.tasking2—this version is an evolution of tasking. It potentially allows overlapping of generation of new data into the buffer and merging of latest results into the final result by the thread that launched computational tasks in version tasking. The only difference compared to the tasking version is that data generation is executed using
#pragma omp task .dynamic-for (Figure 4)—this version is similar to the tasking one with the exception that instead of tasks, in each iteration of the loop a function processing a given input data packet is launched. Parallelization of the for loop is performed with
#pragma omp for with a dynamic chunk 1 size scheduling clause. Upon completion, output is merged, new input data is generated and the procedure is repeated.dynamic-for2 (Figure 5)—this version is an evolution of dynamic-for. It allows overlapping of generation of new data into the buffer and merging of latest results into the final result through assignment of both operations to threads with various ids (such as 0 and 4 in the listing). It should be noted that ids of these threads can be controlled in order to make sure that these are threads running on different physical cores as was the case for the two systems tested in the following experiments.
For test purposes, all implementations used the buffer of 512 elements which is a multiple of the numbers of logical processors.
4. Experiments
4.1. Parametrized Irregular Testbed Applications
The following two applications are irregular in nature which results in various execution times per data chunk and subsequently exploits the dynamic load balancing capabilities of the tested master–slave implementations.
4.1.1. Parallel Adaptive Quadrature Numerical Integration
The first, compute-intensive, application, is numerical integration of any given function. For benchmarking, integration of was run over the [0, 100] range. The range was partitioned into 100,000 subranges which were regarded as data chunks in the processing scheme. Each subrange was then integrated (by a slave) by using the following adaptive quadrature [26] and recursive technique for a given range [a, b] being considered:
if the area of triangle is smaller than
(k = 18) then the sum of areas of two trapezoids
and
is returned as a result,
otherwise, recursive partitioning into two subranges and is performed and the aforementioned procedure is repeated for each of these until the condition is met.
This way increasing the partitioning coefficient increases accuracy of computations and consequently increases the compute to synchronization ratio. Furthermore, this application does not require large size memory and is not memory bound.
4.1.2. Parallel Image Recognition
In contrast to the previous application, parallel image recognition was used as a benchmark that requires much memory and frequent memory reads. Specifically, the goal of the application is to search for at least one occurrence of a template (sized TEMPLATEXSIZExTEMPLATEYSIZE in pixels) within an image (sized IMAGEXSIZExIMAGEYSIZE).
In this case, the initial image is partitioned and within each chunk, a part of the initial image of size (TEMPLATEXSIZE + BLOCKXSIZE)x(TEMPLATEYSIZE + BLOCKYSIZE) is searched for occurrence of the template. In the actual implementation values of IMAGEXSIZE = IMAGEYSIZE = 20,000, BLOCKXSIZE = BLOCKYSIZE = 20, TEMPLATEXSIZE =TEMPLATEYSIZE = 500 in pixels were used.
The image was initialized with every third row and every third column having pixels not matching the template. This results in earlier termination of search for template, also depending on the starting search location in the initial image which results in various search times per chunk.
In the case of this application a compute coefficient reflects how many passes over the initial image are performed. In actual use cases it might correspond to scanning slightly updated images in a series (e.g., satellite images or images of location taken with a drone) for objects. On the other hand, it allows to simulate scenarios of various relative compute to memory access and synchronization overheads for various systems.
4.2. Testbed Environment and Methodology of Tests
Experiments were performed on two systems typical of a modern desktop and workstation systems with specifications outlined in Table 1.
The following combinations of tests were performed: {code implementation} × {range of thread counts} × {affinity setting} × {partitioning coefficients: 1, 8, 32}. The range of thread counts tested depends on the implementation and varied as follows, based on preliminary tests that identified the most interesting values based on most promising execution times, where means the number of logical processors: for designated-master these were , , and , for all other versions the following were tested: , , and . Thread affinity settings were imposed with environment variables
4.3. Results
Since all combinations of tested configurations resulted in a very large number of execution times, we present best results as follows. For each partitioning coefficient separately for numerical integration and compute coefficient for image recognition and for each code implementation 3 best results with a configuration description are presented in Table 2 and Table 3 for numerical integration as well as in Table 4 and Table 5 for image recognition, along with the standard deviation computed from the results. Consequently, it is possible to identify how code versions compare to each other and how configurations affect execution times.
Additionally, for the coefficients, execution times and corresponding standard deviation values are shown for various numbers of threads. These are presented in Figure 6 and Figure 7 for numerical integration as well as in Figure 8 and Figure 9 for image recognition.
4.4. Observations and Discussion
4.4.1. Performance
From the performance point of view, based on the results the following observations can be drawn and subsequently be generalized:
For numerical integration, best implementations are tasking and dynamic-for2 (or dynamic-for for system 1) with practically very similar results. These are very closely followed by tasking2 and dynamic-for and then by visibly slower integrated-master and designated-master.
For image recognition best implementations for system 1 are dynamic-for2/dynamic-for and integrated-master with very similar results, followed by tasking, designated-master and tasking2. For system 2, best results are shown by dynamic-for2/dynamic-for and tasking2, followed by tasking and then by visibly slower integrated-master and designated-master.
For system 2, we can see benefits from overlapping for dynamic-for2 over dynamic-for for numerical integration and for both tasking2 over tasking, as well as dynamic-for2 over dynamic-for for image recognition. The latter is expected as those configurations operate on considerably larger data and memory access times constitute a larger part of the total execution time, compared to integration.
For the compute intensive numerical integration example we see that best results were generally obtained for oversubscription, i.e., for tasking* and dynamic-for* best numbers of threads were 64 rather than generally 32 for system 2 and 16 rather than 8 for system 1. The former configurations apparently allow to mitigate idle time without the accumulated cost of memory access in the case of oversubscription.
In terms of thread affinity, for the two applications best configurations were measured for default/noprocbind for numerical integration for both systems and for thrclose/corspread for system 1 and sockets for system 2 for smaller compute coefficients and default for system 1 and noprocbind for system 2 for compute coefficient 8.
For image recognition, configurations generally show visibly larger standard deviation than for numerical integration, apparently due to memory access impact.
We can notice that relative performance of the two systems is slightly different for the two applications. Taking into account best configurations, for numerical integration system 2’s times are approx. 46–48% of system 1’s times while for image recognition system 2’s times are approx. 53–61% of system 1’s times, depending on partitioning and compute coefficients.
We can assess gain from HyperThreading for the two applications and the two systems (between 4 and 8 threads for system 1 and between 16 and 32 threads for system 2) as follows: for numerical integration and system 1 it is between 24.6% and 25.3% for the coefficients tested, for system 2 it is between 20.4% and 20.9%; for image recognition and system 1, it is between 10.9% and 11.3% and similarly for system 2 between 10.4% and 11.3%.
We can see that ratios of best system 2 to system 1 times for image recognition are approx. 0.61 for coefficient 2, 0.57 for coefficient 4 and 0.53 for coefficient 8 which means that results for system 2 for this application get relatively better compared to system 1’s. As outlined in Table 1, system 2 has larger cache and for subsequent passes more data can reside in the cache. This behavior can also be seen when results for 8 threads are compared—for coefficients 2 and 4 system 1 gives shorter times but for coefficient 8 system 2 is faster.
integrated-master is relatively better compared to the best configuration for system 1 as opposed to system 2—in this case, the master’s role can be taken by any thread, running on one of the 2 CPUs.
The bottom line, taking into consideration the results, is that preferred configurations are tasking and dynamic-for based ones, with preferring thread oversubscription (2 threads per logical processor) for the compute intensive numerical integration and 1 thread per logical processor for memory requiring image recognition. In terms of affinity, default/noprocbind are to be preferred for numerical integration for both systems and thrclose/corspread for system 1 and sockets for system 2 for smaller compute coefficients and default for system 1 and noprocbind for system 2 for compute coefficient 8.
4.4.2. Ease of Programming
Apart from the performance of the proposed implementations, ease of programming can be assessed in terms of the following aspects:
code length—the order from the shortest to the longest version of the code is as follows: tasking, dynamic-for, tasking2, integrated-master, dynamic-for2 and designated-master,
the numbers of OpenMP directives and functions. In this case the versions can be characterized as follows:
designated-master—3 directives and 13 function calls;
integrated-master—1 directive and 6 function calls;
tasking—4 directives and 0 function calls;
tasking2—6 directives and 0 function calls;
dynamic-for—7 directives and 0 function calls;
dynamic-for2—7 directives and 1 function call,
which makes tasking the most elegant and compact solution.
controlling synchronization—from the programmer’s point of view this seems more problematic than the code length, specifically how many distinct thread codes’ points need to synchronize explicitly in the code. In this case, the easiest code to manage is tasking/tasking2 as synchronization of independently executed tasks is performed in a single thread. It is followed by integrated-master which synchronizes with a lock in two places and dynamic-for/dynamic-for2 which require thread synchronization within
#pragma omp parallel , specifically using atomics and designated-master which uses two locks, each in two places. This aspect potentially indicates how prone to errors each of these implementations can be for a programmer.
5. Conclusions and Future Work
Within the paper, we compared six different implementations of the master–slave paradigm in OpenMP and tested relative performances of these solutions using a typical desktop system with 1 multi-core CPU—Intel i7 Kaby Lake and a workstation system with 2 multi-core CPUs—Intel Xeon E5 v4 Broadwell CPUs.
Tests were performed for irregular numerical integration and irregular image recognition with three various compute intensities and for various thread affinities, compiled with the popular
Future work includes investigation of aspects such as the impact of buffer length and false sharing on the overall performance of the model, as well as performing tests using other compilers and libraries. Furthermore, tests with a different compiler and OpenMP library such as using, e.g.,
Funding
This research received no external funding.
Conflicts of Interest
The author declares no conflict of interest.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Figures and Tables
Table 1Testbed configurations.
| Testbed | 1 | 2 |
|---|---|---|
| CPUs | Intel(R) Core(TM) i7-7700 CPU 3.60 GHz Kaby Lake, 8 MB cache | 2 × Intel(R) Xeon(R) CPU E5-2620 v4 2.10 GHz Broadwell, 20 MB cache per CPU |
| CPUs— total number of physical/logical processors | 4/8 | 16/32 |
| System memory size (RAM) [GB] | 16 GB | 128 GB |
| Operating system | Ubuntu 18.04.1 LTS | Ubuntu 20.04.1 LTS |
| Compiler/version | gcc version 9.3.0 (Ubuntu 9.3.0-11ubuntu0 18.04.1), | gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1 20.04), |
Numerical integration—system 1 results.
| Part. Coeff. | Version | Time 1/std dev/Affinity/Number of Threads | Time 2/std dev/Affinity/Number of Threads | Time 3/std dev/Affinity/Number of Threads |
|---|---|---|---|---|
| 1 | integrated-master | 18.555/0.062/thrclose/8 | 18.610/0.068/corspread/8 | 18.641/0.086/corclose/8 |
| designated-master | 21.088/0.078/corspread/8 | 21.098/0.090/default/8 | 21.106/0.096/noprocbind/8 | |
| tasking | 18.363/0.047/noprocbind/16 | 18.416/0.094/default/16 | 18.595/0.071/thrclose/8 | |
| tasking2 | 18.394/0.088/noprocbind/16 | 18.411/0.093/default/16 | 18.654/0.092/thrclose/8 | |
| dynamic-for | 18.389/0.079/default/16 | 18.428/0.105/noprocbind/16 | 18.554/0.073/corclose/16 | |
| dynamic-for2 | 18.399/0.093/default/16 | 18.416/0.101/noprocbind/16 | 18.572/0.073/corclose/16 | |
| 8 | integrated-master | 27.333/0.105/thrclose/8 | 27.341/0.106/default/8 | 27.373/0.084/noprocbind/8 |
| designated-master | 30.885/0.102/corspread/8 | 30.898/0.111/thrclose/8 | 30.956/0.130/noprocbind/8 | |
| tasking | 26.844/0.081/default/16 | 26.898/0.146/noprocbind/16 | 27.325/0.116/default/8 | |
| tasking2 | 26.865/0.131/default/16 | 26.901/0.161/noprocbind/16 | 27.299/0.134/thrclose/8 | |
| dynamic-for | 26.865/0.105/default/16 | 26.899/0.155/noprocbind/16 | 27.217/0.121/corspread/16 | |
| dynamic-for2 | 26.830/0.073/noprocbind/16 | 26.930/0.158/default/16 | 27.204/0.115/corclose/16 | |
| 32 | integrated-master | 34.492/0.157/thrclose/8 | 34.526/0.137/corspread/8 | 34.555/0.202/corclose/8 |
| designated-master | 39.005/0.149/thrclose/8 | 39.015/0.199/corclose/8 | 39.039/0.174/default/8 | |
| tasking | 33.816/0.151/noprocbind/16 | 33.889/0.333/default/16 | 34.356/0.149/noprocbind/8 | |
| tasking2 | 33.828/0.174/noprocbind/16 | 33.838/0.152/default/16 | 34.340/0.148/thrclose/8 | |
| dynamic-for | 33.781/0.165/noprocbind/16 | 33.808/0.148/default/16 | 34.354/0.165/thrclose/16 | |
| dynamic-for2 | 33.826/0.180/noprocbind/16 | 33.860/0.155/default/16 | 34.260/0.127/corclose/16 |
Numerical integration—system 2 results.
| Part. Coeff. | Version | Time 1/std dev/Affinity/Number of Threads | Time 2/std dev/Affinity/Number of Threads | Time 3/std dev/Affinity/Number of Threads |
|---|---|---|---|---|
| 1 | integrated-master | 9.158/0.117/corspread/32 | 9.201/0.145/thrclose/32 | 9.214/0.217/sockets/32 |
| designated-master | 9.585/0.149/corclose/33 | 9.601/0.197/thrclose/33 | 9.638/0.122/default/33 | |
| tasking | 8.567/0.017/default/64 | 8.585/0.027/noprocbind/64 | 8.664/0.025/sockets/64 | |
| tasking2 | 8.599/0.033/default/64 | 8.602/0.025/noprocbind/64 | 8.677/0.026/sockets/64 | |
| dynamic-for | 8.584/0.025/noprocbind/64 | 8.584/0.032/default/64 | 8.649/0.024/sockets/64 | |
| dynamic-for2 | 8.570/0.024/default/64 | 8.573/0.021/noprocbind/64 | 8.636/0.024/sockets/64 | |
| 8 | integrated-master | 13.718/0.127/corclose/32 | 13.748/0.182/corspread/32 | 13.770/0.111/default/32 |
| designated-master | 14.402/0.105/corclose/33 | 14.447/0.529/thrclose/32 | 14.481/0.677/sockets/32 | |
| tasking | 12.724/0.034/default/64 | 12.727/0.040/noprocbind/64 | 12.776/0.038/sockets/64 | |
| tasking2 | 12.749/0.044/default/64 | 12.771/0.035/noprocbind/64 | 12.796/0.044/sockets/64 | |
| dynamic-for | 12.792/0.041/default/64 | 12.796/0.033/noprocbind/64 | 12.845/0.031/sockets/64 | |
| dynamic-for2 | 12.731/0.031/default/64 | 12.753/0.040/noprocbind/64 | 12.811/0.047/sockets/64 | |
| 32 | integrated-master | 17.471/0.080/corspread/32 | 17.486/0.105/corclose/32 | 17.551/0.152/thrclose/32 |
| designated-master | 18.359/0.839/corspread/32 | 18.423/0.447/default/32 | 18.431/0.205/corclose/33 | |
| tasking | 16.116/0.051/noprocbind/64 | 16.120/0.055/sockets/64 | 16.175/0.420/default/64 | |
| tasking2 | 16.119/0.039/default/64 | 16.142/0.062/noprocbind/64 | 16.157/0.042/sockets/64 | |
| dynamic-for | 16.181/0.049/default/64 | 16.210/0.043/noprocbind/64 | 16.228/0.046/sockets/64 | |
| dynamic-for2 | 16.116/0.025/default/64 | 16.119/0.043/noprocbind/64 | 16.152/0.038/sockets/64 |
Image recognition—system 1 results.
| Comp. Coeff. | Version | Time 1/std dev/Affinity/Number of Threads | Time 2/std dev/Affinity/Number of Threads | Time 3/std dev/Affinity/Number of Threads |
|---|---|---|---|---|
| 2 | integrated-master | 9.530/0.2104/noprocbind/8 | 9.561/0.173/default/8 | 9.578/0.183/thrclose/8 |
| designated-master | 10.388/0.125/thrclose/8 | 10.434/0.179/noprocbind/8 | 10.450/0.222/default/8 | |
| tasking | 9.576/0.175/default/8 | 9.622/0.166/noprocbind/8 | 9.697/0.188/corclose/8 | |
| tasking2 | 12.762/0.059/noprocbind/8 | 12.777/0.093/thrclose/8 | 12.782/0.081/default/8 | |
| dynamic-for | 9.389/0.131/thrclose/8 | 9.392/0.156/noprocbind/8 | 9.403/0.151/default/8 | |
| dynamic-for2 | 9.378/0.135/thrclose/8 | 9.395/0.165/default/8 | 9.446/0.176/default/16 | |
| 4 | integrated-master | 18.406/0.297/noprocbind/8 | 18.428/0.329/corclose/8 | 18.492/0.352/default/8 |
| designated-master | 20.175/0.196/corspread/8 | 20.219/0.305/default/8 | 20.404/0.367/noprocbind/8 | |
| tasking | 18.505/0.428/noprocbind/8 | 18.514/0.308/thrclose/8 | 18.540/0.353/default/8 | |
| tasking2 | 24.935/0.154/noprocbind/8 | 24.940/0.150/corspread/8 | 24.967/0.244/thrclose/8 | |
| dynamic-for | 18.332/0.264/noprocbind/8 | 18.332/0.475/default/8 | 18.405/0.442/corspread/8 | |
| dynamic-for2 | 18.282/0.229/corspread/8 | 18.318/0.407/thrclose/8 | 18.367/0.408/default/8 | |
| 8 | integrated-master | 35.995/0.678/noprocbind/8 | 36.096/0.726/default/8 | 36.282/0.612/thrclose/8 |
| designated-master | 39.969/0.526/default/8 | 40.120/0.595/corclose/8 | 40.163/0.623/thrclose/8 | |
| tasking | 36.223/0.718/noprocbind/8 | 36.307/0.691/corspread/8 | 36.372/0.664/thrclose/8 | |
| tasking2 | 49.418/0.225/default/8 | 49.438/0.411/noprocbind/8 | 49.444/0.326/corspread/8 | |
| dynamic-for | 35.852/0.503/default/8 | 36.018/0.596/corspread/16 | 36.129/0.597/noprocbind/16 | |
| dynamic-for2 | 35.969/0.462/thrclose/8 | 36.099/0.669/default/8 | 36.190/0.675/noprocbind/8 |
Image recognition—system 2 results.
| Comp. Coeff. | Version | Time 1/std dev/Affinity/Number of Threads | Time 2/std dev/Affinity/Number of Threads | Time 3/std dev/Affinity/Number of Threads |
|---|---|---|---|---|
| 2 | integrated-master | 6.406/0.321/thrclose/32 | 6.880/0.918/default/32 | 7.002/0.738/corclose/32 |
| designated-master | 6.283/0.311/sockets/33 | 6.644/0.364/noprocbind/33 | 6.697/0.463/default/33 | |
| tasking | 6.164/0.145/corclose/64 | 6.223/0.181/corspread/64 | 6.249/0.117/sockets/64 | |
| tasking2 | 5.981/0.208/corclose/64 | 5.995/0.165/corspread/64 | 5.997/0.067/sockets/64 | |
| dynamic-for | 5.705/0.208/default/32 | 5.722/0.105/sockets/64 | 5.739/0.088/corclose/32 | |
| dynamic-for2 | 5.682/0.072/sockets/32 | 5.697/0.055/noprocbind/32 | 5.709/0.099/default/32 | |
| 4 | integrated-master | 11.583/0.564/noprocbind/32 | 11.661/0.218/sockets/32 | 11.716/0.420/corclose/32 |
| designated-master | 11.808/1.572/corclose/32 | 11.857/1.097/sockets/33 | 11.878/0.803/noprocbind/33 | |
| tasking | 10.848/0.085/default/32 | 10.889/0.128/corclose/32 | 10.903/0.142/sockets/32 | |
| tasking2 | 10.460/0.141/sockets/32 | 10.472/0.145/corspread/32 | 10.485/0.170/default/32 | |
| dynamic-for | 10.625/0.140/default/32 | 10.629/0.133/corclose/32 | 10.635/0.161/noprocbind/32 | |
| dynamic-for2 | 10.585/0.150/sockets/32 | 10.598/0.100/noprocbind/32 | 10.610/0.140/default/32 | |
| 8 | integrated-master | 20.556/0.620/noprocbind/32 | 20.595/0.708/corclose/32 | 20.738/0.861/corspread/32 |
| designated-master | 20.705/0.836/default/33 | 20.924/4.271/sockets/32 | 21.224/0.987/noprocbind/33 | |
| tasking | 20.014/0.197/sockets/32 | 20.054/0.201/corclose/32 | 20.076/0.235/corspread/32 | |
| tasking2 | 19.120/0.305/noprocbind/32 | 19.152/0.187/sockets/32 | 19.240/0.292/corspread/32 | |
| dynamic-for | 19.758/0.193/default/32 | 19.825/0.210/thrclose/32 | 19.828/0.219/corspread/32 | |
| dynamic-for2 | 19.816/0.229/noprocbind/32 | 19.828/0.249/default/32 | 19.863/0.256/thrclose/32 |
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
The paper investigates various implementations of a master–slave paradigm using the popular OpenMP API and relative performance of the former using modern multi-core workstation CPUs. It is assumed that a master partitions available input into a batch of predefined number of data chunks which are then processed in parallel by a set of slaves and the procedure is repeated until all input data has been processed. The paper experimentally assesses performance of six implementations using OpenMP locks, the tasking construct, dynamically partitioned for loop, without and with overlapping merging results and data generation, using the gcc compiler. Two distinct parallel applications are tested, each using the six aforementioned implementations, on two systems representing desktop and worstation environments: one with Intel i7-7700 3.60 GHz Kaby Lake CPU and eight logical processors and the other with two Intel Xeon E5-2620 v4 2.10 GHz Broadwell CPUs and 32 logical processors. From the application point of view, irregular adaptive quadrature numerical integration, as well as finding a region of interest within an irregular image is tested. Various compute intensities are investigated through setting various computing accuracy per subrange and number of image passes, respectively. Results allow programmers to assess which solution and configuration settings such as the numbers of threads and thread affinities shall be preferred.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer





