Content area
Graphic Processing Units (GPUs) are unanimously considered as powerful computational resources. General-purpose computing on GPU (GPGPU), as well, is the de facto infrastructure for most of the today computationally intensive problems that researchers all over the globe dill with. High Performance Computing (HPC) facilities use state of the art GPUs. Many domains like deep learning, machine learning, and computational finance uses GPU's for decreasing the execution time. GPUs are widely used in data centers for high performance computing where virtualization techniques are intended for optimizing the resource utilization (e.g. GPU cloud computing). The GPU programming model requires for all the data to be stored in a global memory before it is used. This limits the dimension of the problem a GPU can handle. A system utilizing a cluster of GPU would have a bigger level of parallelization but also would eliminate the memory limitation imposed by a single GPU. These being just a few of the problems a programmer needs to handle. However, the ratio between specialists that are able to efficiently program such processors and the rest of programmers is very small. One important reason for this situation is the steepness of the GPU programming learning curve due to the complex parallel architecture of the processor. Therefore, the tool presented in this article aims to provide visual support for a better understanding of the execution on GPU. With it, the programmers can easily observe the trace of the parallel execution on their own algorithm and, from that, they could determine the unused GPU capacity that could be better exploited.
Abstract: Graphic Processing Units (GPUs) are unanimously considered as powerful computational resources. General-purpose computing on GPU (GPGPU), as well, is the de facto infrastructure for most of the today computationally intensive problems that researchers all over the globe dill with. High Performance Computing (HPC) facilities use state of the art GPUs. Many domains like deep learning, machine learning, and computational finance uses GPU's for decreasing the execution time. GPUs are widely used in data centers for high performance computing where virtualization techniques are intended for optimizing the resource utilization (e.g. GPU cloud computing). The GPU programming model requires for all the data to be stored in a global memory before it is used. This limits the dimension of the problem a GPU can handle. A system utilizing a cluster of GPU would have a bigger level of parallelization but also would eliminate the memory limitation imposed by a single GPU. These being just a few of the problems a programmer needs to handle. However, the ratio between specialists that are able to efficiently program such processors and the rest of programmers is very small. One important reason for this situation is the steepness of the GPU programming learning curve due to the complex parallel architecture of the processor. Therefore, the tool presented in this article aims to provide visual support for a better understanding of the execution on GPU. With it, the programmers can easily observe the trace of the parallel execution on their own algorithm and, from that, they could determine the unused GPU capacity that could be better exploited.
Keywords: GPU architecture; GPU programming; GPGPU.
I.INTRODUCTION
Much of the architectural complexity of a computational system is kept hidden from programmers (the term transparent is often used). Integrated developments environments and tools (IDE) with complex compilers and simulators tries to speedup the process of developing application. Time to market is the main driver for such approach. On the other hand, the same principle applies also to those IDEs, therefore we could assume that they are not so optimized. Others can argue that these optimizations are not required trading the benefits they provide in opposition to the cost of development. One exception here could be the case of battery powered devices that requires an increased autonomy. Besides this, we can motivate another reason why it is important to understand how complex processing architecture works, why they have that specific design and how can they be exploited at their peak. We didn't comment on the importance of interdisciplinary education, instead we present a short analogy to fundament the importance of assimilating by students the complex heterogenous computing architectures.
There is a good similarity between how manufacturing resources are allocated in factories and how processing resources are managed in computers. This is one of the reasons why many analogies are made between manufacturing production and computing on complex processing architecture. Terms like pipeline, bottlenecks, capacity and so on are used in the both domains mentioned above. The idea of increasing the throughput with the help of pipelines or eliminating the bottlenecks by adding parallelism are common. In the manufacturing field, dimensioning capacity requires large investments, therefore there are fewer places where this kind of investments are made and not so much demand for designer jobs. Additionally, to make these investments more attractive, flexibility was added to such expensive manufacturing equipment (e.g. robots, flexible manufacturing cells). This could be seen as resembling the software concept for fixed hardware. Continuing this reasoning we could imagine that in the future will increase the demand for jobs that requires robots programming or other complex architecture. Simulators will be the starting place for training the personnel, but the overall picture of the entire manufacturing plant is very important to have in consideration. Further, we consider that heterogenous computational architectures resemble accurately enough the flow complexity met in manufacturing plants. Therefore, we advocate the idea of a good understandings of complex processing architectures among computer science students.
The study of computing architecture is omnipresent in computer science curricula, but there are not so many skilled graduates in this domain. As stated before, we consider that the main reason is not because there is no so much demands for these skills on the labour market, but because it is difficult to be assimilated by students. The framework described in this article cams as step forward in overcoming the above difficulty. We uphold this approach with the experience gained after evaluating the framework on a course entitled "Parallel processing for multimedia application" taught in the first year of a master program ("Complex signal processing for multimedia application") within Faculty of Automatic Control and Computer Science from University "Polyethnic" of Bucharest.
II.RELATED WORK
Parallel programming and understanding of the way multi and many cores architectures work have become an essential component in the education of future specialists. At the Faculty of Automatic Control and Computer Science, in all specialties, there is a continuing concern for adapting and improving this learning direction [1].
There are many challenges in understanding the parallel execution and more than in understanding the combined CPU and GPU programming simultaneously and alternately. Graphical simulation of the execution process is a method that allows students to understand what is behind the parallel execution process. There are two ways to do this: using the tools provided by hardware and development vendors such as NVIDIA Nsight Systems or Intel System Analyzer and Platform Analyzer or various private tools developed for didactic or research purposes. NVIDIA Nsight Systems is a graphical performance analysis tool designed specifically to enable software optimization. This tool enables blocking detection, CPU and GPU parallelism, inefficient use of hardware resources as well as unnecessary synchronization operations. This tool is available for NVIDIA platforms [2]. Intel System Analyzer and Platform Analyzer are tools for analysing OpenCL applications by combining CPU and GPU load observation based on application execution and accessed API functions. Using this tool, you can see the execution queue of the processors in the system and the evolution of the application execution. It is primarily designed for Intel's hardware [3]. These commercial tools are very powerful on the development and application optimization side, but they are too complex for a student at the start of the road. Additionally, being thought only of the manufacturer's hardware may be a limitation for the learning process. Therefore, there are many attempts to develop personalized tools dedicated to understanding and learning how to program multi and many core systems.
Starting from the simulation of distributed systems [4] and up to the internal simulation of multi and many core systems [5] [6], there are several attempts to develop educational systems independent of a particular hardware or software manufacturer that can be used in the understanding of functioning of this type of system. As shown in [7] one of the most important issues of such an instrument is the realization of an intuitive user interface and a significant role in understanding parallel execution on the CPU and GPU.
III. PURPOSED MODULES
We purpose a learning model composed from two modules: a code module and a visual module. In each of the models the student's contributions or changes are kept to a minimal. This was made so that the student can focus more on the understanding how an algorithm is running on a GPGPU device.
The cod module was written in C++ with two main libraries, OpenCL and Boost. OpenCL is an open industry standard for programming a heterogeneous collection of CPUs, GPUs and other discrete computing devices organized into a single platform. It is more than a language. OpenCL is a framework or parallel programming and includes a language, API, libraries and a runtime system to support software development. Using OpenCL, for example, a programmer can write general purpose programs that execute on GPUs without the need to map their algorithms onto a 3D graphics API such as OpenGL or DirectX. The target of OpenCL is expert programmers wanting to write portable yet efficient code. This includes library writers, middleware vendors, and performance-oriented application programmers. Therefore, OpenCL provides a low-level hardware abstraction plus a framework to support programming and many details of the underlying hardware are exposed. [8].
Because this is a very complex architecture for a normal student to use and understand, we wrapped this library into a framework that allows the programmers to add or change small parameters like, the algorithm that is run on the GPU, the input/output data and the degree of parallelisation of the algorithm on the GPU.
3.1 Installation process
The student needs to have the following tools installed on the local computer: Visual Studio 2071 and CodeXL for debugging on the GPU. The project comes packaged as a zip folder with all the necessary components, the student needs only to extract the folder from the zip and copy the project folder to a desired location on the PC. Because we use Visual Studio the only supported operation system is Windows.
In the project folder the student will see three folder:
* OpenCL that contains the necessary all the necessary drivers OpenCL drivers for all the venders and also the last version of the OpenCL headers
* common_opencl that contains the OpenCL framework designed by us
* test_app in which the student will to all the changes
3.2 First run
For starting the solution, the student will double click the gpu_learning.sln so that Visual Studio can load the project. The main project is test_app, on a closer inspection the student is going to find three files:
* OpenCL.cl contains the algorithm that is going to run on the GPU and it comes with already example written in it (matrix multiplication)
* matrix_multiplication.h contains the allocation and generation of the input data, the allocation of resources and the retrieving and processing of the output results for a matrix multiplication setup
* app.cpp contains the initialisation of the log file and the execution for the matrix multiplication example
When running for the first time the program the student can see the following information in the console output window: information about the current GPU or GPUS installed on the system (platform information, vendor information, compute units, work group size, global memory size, local memory size, etc.as seen in Figure 1).
After this the student can see the steps that are executed in preparation for the execution of the algorithm on the GPU, the input data and the allocation of the resources for the GPU (Figure 2).
The last information that is the most imported one for the visualization model, the student will see: thread id for x and thread id for y axes, calculated result and thread execution order and also the execution time of the operation (Figure 3).
3.3Students interactions
The first task of the student is to start changing the input data, the allocation of the resources and the processing of the output result. For this operation the student needs to open the matrix_multiplication.h file in which he will find five methods (Figure 4):
* initialise_values - this method is user for creating the initial input data (data, type, input and output type and also all the constant used on the kernel)
* set_work_group_size - this is a more complex method as the student needs to implement the following requests:
o check if the device has all the necessary resources for the current application
o set the resources that are available
* main_function - that calls all the methods for execution on the GPU (the student does not need to change or interact with this method)
* get_outputs - this method extracts the results from the GPU loads them into the local RAM and analyses the results
* clean_up - perform a clean-up of all the allocated resources on the GPU(the student does not need to change or interact with this method)
The second step the student will start modifying the OpenCL.cl file, where the he will change the type of the input and output data and he will need to change order of the matrixes are multiplied (Figure 5).
The first thing the student will recognize is the number of parameters in the signatures of the method, in this case the first two are the input matrixes, the third is the result of the multiplication, the forth and the fifth are the thread id, the sixth represent the counter for atomic operation [9] for getting the thread id the last one is the width of the matrix.
Because in our example we're using two dimensional values we are using parallelization power of the GPU to access different data on different thread. The global work-item ID specifies the work-item ID based on the number of global work-items specified to execute the kernel. Valid values of dimindx are 0 to get_work_dim() - 1. For other values of dimindx, get_global_id() returns 0 [10]. The matrixes are square one so that the multiplication algorithm is easy to understand and change.
Atomic operations may sometimes be used for synchronization between elements (for interdependent work). In our case we use atomic_inc because we need to add into the thread id vectors the order of the thread execution.
Because the open cl kernel does not return a type, in the method signature we also add the return values. These are going to be populated at the end of the kernel function.
After running the console app the student will find a log file with all the above information. This file is going to be user as input data for the visualization module.
For the visualization module the student needs to install Unity. Unity is a cross-platform game engine developed by Unity Technologies primarily used to develop 2D and 3D video games. Unity has powerful physics engine and a GUI-based integrated development environment with support for scripting using C#. This way the student can very easy add new objects, change or add new properties to the existing objects, etc. These features are very helpful in developing and simulation in which the student can see and interact with GPU operation.
By default, the Unity project is already configured for running the simulation, the only step that needs to be done is that the console application needs to be run first so that it can generate the log file containing all the thread information and results (Figure 6).
The simulation will generate the two blue matrixes, the number of threads, 16 of them, (each thread is represented by 1 scale vector and 4 unit vectors, L1 cache, Local data Share (LDS), L2 cache for the Global memory and global memory) two buttons for forwarding and going backwords on the data and thread executed on the GPU. On each click the result matrix is populated with the result and the location in the matrix based on the threads it was calculated. If the user will click on one of the result, he will see that thread actually calculated and the type of memory used.
These modules where test with the help of students which were very agar on learning and understanding how a GPU work when handling computing problems. By the end of the class the students were able to write optimized code for different problems and also purpose improvements on the base code so that the process will be easier and more straightforward in the future.
IV.CONCLUSIONS
After first use of this framework among students we have noticed several positive aspects. Firstly, they are able to fulfil their projects for the course which is the main objective and desire for us, considering that this is a mean of evaluation their abilities. Secondly, they seam to be more involved during the laboratory class. This can be seen as they understand what they are doing, what is the reasoning behind their work, how that can help them in the future, and, not at last, they see progress in what they are learning. Another positive aspect in the students' approach is that they are starting to port (implement) some of their ideas of applications they were working on the GPU architecture they acknowledge. The framework is yet a work in progress and how it was received by the students motivated the writing of this article.
Reference Text and Citations
[1] M. Carabas, A. Draghici, G. Lupescu, C.-G. Samoila, E.-I. Slusanschi, Integrating Parallel Computing in the Curriculum of the University Politehnica of Bucharest, Euro-Par 2018 International Workshops, Turin, Italy, August 27-28, 2018, Revised Selected Papers, DOI: 10.1007/978-3-030-10549-5_18
[2] NVIDIA Nsight Systems https://developer.nvidia.com/nsight-systems (accessed on February 2, 2019)
[3] Profiling OpenCL™ Applications with System Analyzer and Platform Analyzer https://software.intel.com/sites/default/files/managed/1d/80/Intel-Profiling-OpenCL-Applications-with-SystemAnalyzer-and-Platform-Analyzer.pdf (accessed on February 2, 2019)
[4] S. Lin, X. Cheng, J. Lv, A Visualized Parallel Network Simulator for Modeling Large-scale Distributed Applications, Eighth International Conference on Parallel and Distributed Computing, Applications and Technologies 2007 IEEE, DOI: 10.1109/PDCAT.2007.35
[5] S. Bultrowicz, P. Czarnul, P. Rosciszewski, Runtime Visualization of Application Progress and Monitoring of a GPU-enabled Parallel Environment, Applications of Information Systems in Engineering and Bioscience, pp. 7079, ISBN: 978-960-474-381-0
[6] Joel C. Adams, P. A. Crain, M. B. Vander Stel, A Thread Safe Graphics Library For Visualizing Parallelism, ICCS 2015 International Conference On Computational Science, Volume 51, pp.1986-1995
[7] P. J. Zeno, Visualization Tool for GPGPU Programming, ASEE 2014 Zone I Conference, April 3-5, 2014, University of Bridgeport, Bridgeport, CT, USA
[8] https://www.khronos.org/opencl/
[9] https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/atomicFunctions.html
[10] https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/get_global_id.html
[11] https://en.wikipedia.org/wiki/Compute_kernel
Copyright "Carol I" National Defence University 2019