Content area
ABSTRACT
In the realm of astrophysical numerical calculations, the demand for enhanced computing power is imperative. The time‐consuming nature of calculations, particularly in the domain of solar convection, poses a significant challenge for Astrophysicists seeking to analyze new data efficiently. Because they let different kinds of data be worked on separately, parallel algorithms are a good way to speed up this kind of work. A lot of this study is about how to use both multi‐core computers and GPUs to do math work about solar energy at the same time. Cutting down on the time it takes to work with data is the main goal. This way, new data can be looked at more quickly and without having to practice for a long time. It works well when you do things in parallel, especially when you use GPUs for 3D tasks, which speeds up the work a lot. This is proof of how important it is to adjust the parallelization methods based on the size of the numbers. But for 2D math, computers with more than one core work better. The results not only fix bugs in models of solar convection, but they also show that speed changes a little based on the gear and how it is processed.
Introduction
Parallel processing is used a lot because computers need to be able to do more, especially when doing astrophysical calculations [1]. Due to the large number of calculations needed, parallel methods are a good way to speed up research in the area of solar convection. They do this because they let many processes run at the same time [2]. People are always looking for ways to make computers faster, smarter, and more powerful. This has led many companies to start using parallel processing as an alternative [3, 4]. This change is especially noticeable in the very complicated field of astrophysics calculations, especially when it comes to solar convection [5]. To deal with the time-consuming problems that come with these kinds of calculations, new approaches are needed [6].
Parallel algorithms can do more than one thing at once, which is why they are great for speeding up computers [7]. Number calculations in astrophysics can be hard because they take a long time and need to be done on a lot of information [8]. This is very important when doing math that has to do with solar convection. A good way to save a lot of time on computations would be to look into the idea of parallelization in these calculations [9]. This would let different data processes run on their own. This makes it easy to understand and faster to do math for solar convection. It also helps us understand the bigger impacts of parallelization in high-performance computing [10]. Scientists can make science better in new ways by studying these methods. They also give academics the tools they need to deal with the problems that come up with the huge amounts of data being made in astronomy [11, 12].
In this study, the practical use of parallel methods is carefully looked into as a way to reduce the processing load that comes with doing numerical calculations of solar convection. The study aims to take advantage of the benefits of parallel processing, which is carefully used across both multi-core computers and Graphics Processing Units (GPUs). It will focus on real-world apps. The goal of this coordinated effort is to significantly cut down on computation times, which will increase efficiency and speed up the discovery of new scientific information in the field of astronomical computing. We learned a lot about how complicated customized parallelization techniques work from the important results of this study. These ideas, which cover a wide range of parallel processing methods, could be used as a basis for finding ways to make things better. The study shows specific tactics that improve the performance of astrophysical processes. This allows for finer tuning of computing methods and shows a dedication to pushing the limits of what is currently possible. Furthermore, the study transcends its immediate objectives by providing a comprehensive understanding of the nuanced utilization of parallel algorithms. This exploration extends beyond a singular focus on multi-core processors and GPUs, encompassing a diverse range of hardware platforms. The resulting insights enrich our collective understanding of the evolving landscape in high-performance computing for astrophysics. This holistic perspective contributes not only to the immediate goals of computational optimization but also to the broader scientific community's awareness and adaptability to the ever-changing technological milieu. The main contributions are as follows:
- Investigating parallel algorithms to reduce computational burden in solar convection calculations, enhancing astrophysical computation efficiency.
- Utilizing multi-core processors and GPUs for significant computation time reductions, expediting astrophysical insights.
- Providing insights into parallelization strategies, laying foundations for optimization and computational advancements.
- Contributing to understanding parallel algorithm use across hardware platforms, enriching high-performance computing awareness in astrophysics.
Our paper comprises several distinct sections: Section 2 reviews related work. In Section 3, we delve into the system model. Section 4 presents convection and its calculation. Following this, Section 5 covers the assessment and analysis. Lastly, in Section 6, we summarize our findings and outline future research directions.
Related Work
Studying previous research in Parallel Solutions in Astrophysical Computing, particularly in enhancing solar convection analysis with multi-core processors and GPUs, is crucial for building upon established methodologies and identifying potential improvements. A comprehensive understanding of prior work not only helps researchers avoid redundant efforts but also provides valuable insights into the evolving landscape of parallel computing, facilitating the development of more efficient and advanced solutions for analyzing solar convection phenomena. In this regard, Yang et al. [13] focused on enhancing the efficiency of numerical computations through the optimization of multi-grid methods and parallelization techniques specifically tailored for multi-core processors. They delved into strategies to harness the computational power of modern multi-core architectures, aiming to accelerate the solution of complex problems such as numerical simulations. The study carefully looked at the details of multi-grid processing and parallelization to improve speed, scaling, and resource usage. This should lead to less time needed to solve problems and maybe even more energy economy. The text of the paper will go into more depth about the optimization methods and how they affect the speed of computers.
Jermyn et al. [14] also worked on building and improving the Modules for Experiments in Stellar Astrophysics (MESA) system. They made tools that deal with certain parts of star astronomy models, like flow that changes over time, keeping energy constant, and automatic differentiation. These parts helped make star evolution models more accurate and believable. In their study, they also talked about how to improve the MESA framework's infrastructure, which could help astronomy experts use computers more efficiently and in more ways. The material of the paper will go into more depth about what each section adds and how it changes star astronomy models. Also, Nogueira et al. [15] looked into how a computer model that simulates two-dimensional solar convection works when using Implicit Large-Eddy Simulation (ILES) methods to find the numerical convergence features. They looked into how close the number solution gets to a stable and correct answer as the computer settings are improved. To do this, they looked at how the grid precision, time step size, and other numerical factors affected how the simulation converged. Their results affect how well and reliably ILES methods can model the complicated dynamics of solar convection. They also give us information about how stable and accurate these simulations are numerically.
In the same frequency band Bekki, Cameron, Gizon [16], also looked into the idea behind sun cycles. The author used a nonlinear rotating convection computer to study and figure out how these waves move and change over time. They paid special attention to the amplitudes of equatorial modes. To help us understand how the Sun's core works, they looked at how spin and flow change the Sun's rhythms. Also, Iijima et al. [17] also looked into the sizes and changes in time of deep convection inside the sun. Their research looked into the complicated movements of turbulent processes below the sun's surface. They wanted to learn how these things change over time and how they work in different sizes. Computer models and data from the field were both used in the study to show how difficult underground flow is. They learned more about how the Sun works on the inside and how those workings affect its energy flow and activity. Table 1 and tools shows the difference between these works.
TABLE 1 Related works.
| Author | Main idea | Advantages | Disadvantages | Simulation environment | Dataset |
| Yang et al. [13] | Focusing on enhancing the efficiency of numerical computations through the optimization of multi-grid methods and parallelization techniques |
|
|
|
|
| Jermyn et al. [14] | Concentrating on the development and enhancement of the MESA framework |
|
|
|
|
| Nogueira et al. [15] | Exploring the numerical convergence characteristics of a computational model simulating two-dimensional solar convection |
|
|
— |
|
| Bekki, Cameron, Gizon [16] | Delving into the theoretical aspects of solar oscillations within the inertial frequency range |
|
|
|
|
| Iijima et al. [17] | Exploring the spatial scales and temporal variations of subsurface convection within the solar interior |
|
|
|
|
| Our work | Enhancing computing power through parallelization in astrophysical numerical calculations |
|
|
|
|
It is important to read older work on parallel solutions in astronomical computing. Most of the time, this is true for work that uses GPUs and computers with multiple cores to better study solar convection. You can improve on methods that are already in place. By reading a lot about what has already been done, researchers can escape having to do the same work twice. Plus, it helps them understand how parallel computing is changing, which makes it easier to find faster and better ways to study turbulent flow in the sun. Astrophysics needs computers with more power because the calculations take a long time, especially when looking at solar convection. it is hard for astrophysicists to figure out how to quickly look at new data. Parallel methods are a good way to speed up calculations because they let different sets of data do their calculations. The work that is being mentioned is mostly about using multi-core CPUs and GPUs to do theory math about sun energy at the same time. The goal is to greatly reduce the time it takes to handle data so that new data can be looked at faster and without having to practice for a long time. The results of the parallel way are very good, especially when it comes to 3D processes on GPUs, which make the work go much faster. This shows how important it is to use different parallelization methods based on the size of the calculations. Multi-core computers work better for 2D calculations, on the other hand. They help us figure out how to fix the issues that show up when we try to figure out solar convection. They also show how speed changes a little when we use various computer tools and methods.
System Model and Proposed Method
It is a single piece of hardware that has two or more different central processing units, or “cores,” that work together. This engineering design lets several computers work together in a single unit, and each core can carry out its tasks [18]. Two types of cache can help these cores run faster: local and shared. They can also use either message passing or shared memory to talk to each other [19]. Intel PCs with a shared L3 cache were used for this work. Also, the OpenMP application programming interface was made so that programs could run on CPUs that have more than one core [20]. These choices about hardware and software are meant to make the most of the parallel processing that comes with multi-core designs. This should make the system faster and stronger [21].
Because it uses the well-known fork-join model of parallel operation, OpenMP is a flexible and useful tool that makes most of the parallel working power of computers with more than one processor or core [22]. We can use this model to solve many types of computer issues, but it shines when we use it to solve problems that involve vast arrays. There are no issues with switching between parallel and sequential working modes in OpenMP. These features let you use it for different computer tasks. Using threads [23], OpenMP can do parallelism. An OpenMP program starts with a single thread, called the master thread, working in a serial area until it reaches the first data structure that works in parallel. The OpenMP C/C++ API's parallel function sets up the parallel construct, which defines the start of a parallel area. When the master thread sees this design, it puts together a group of threads, with the master leadership. Each team thread moves the lines within the parallel area's dynamic range. Multiple jobs can be done at once, and the machine works better as a result [24]. OpenMP uses the structure, which is a work-sharing scheme, to make loops run at the same time. Remember that OpenMP version 2.0 can only parallelize the loop on the outside. Combining parallelism and loop processing in this well-thought-out way lets you get the most out of multi-core computers. Additionally, this speeds up processes overall in both linear and parallel modes [25].
Coming out with the Tesla unified graphics and computing design by NVIDIA in November 2006 was a big deal for Graphics Processing Units (GPUs). GPUs got processing power from this new design that had never been seen before. They could do more than just render pictures. At the heart of this change is the Compute Unified Device Architecture (CUDA) platform, which makes it easy to program GPU parts. Lindholm talked about this important change in 2008. As computers have become more flexible, they can do a lot more than just process pictures. This has made parallel processing much more useful in many areas. Within the Tesla GPU computer model, parallelism is possible at three distinct levels of detail. CUDA is a tool that controls the handling of parallel kernels. This lets the GPU run many parallel threads at the same time. Parallelism at the thread level not only makes the best use of GPU resources, but also makes it easier to do more than one thing at once, which speeds up work in general. The Cooperative Thread Array (CTA), which is also called a “thread block,” is used in the Tesla design to make it easier to cope with many threads going at the same time. When you add CTAs, they help threads work together and share information between different levels of parallelism. This way of working together makes things run more smoothly and allows the system to change and grow to handle different types of computer tasks. Tesla's management method divides CTAs into larger groups known as grids. Having this well-organized structure makes it possible to handle difficult numbers at a high level, ensuring that they can be well-managed and increased as needed. One of the main things that makes the system fast and accurate for a lot of different types of computer work is the way it is organized. A big step forward is the Tesla unified graphics and computing design, which makes GPUs more adaptable and useful for a wider range of computer tasks. When you combine CUDA code with a hierarchical organization of parallelism levels, you get a strong structure that starts a new era of GPU design that goes beyond simple graphics processing. This change shows that GPU technology is always getting better, which makes it possible for huge steps forward in parallel computing [26].
CUDA makes it easy to add parallel kernels to a main program, whether they are simple functions or whole programs. The processor takes programs that have both serial CPU code and parallel GPU kernel code and turns them into code that can be run on the GPU. Because this change is made by the translator, coders can use the GPU as a separate engine with its memory system. This opens up new options for high-performance computing. The CUDA computing model is based on organizing parallelism in layers, with thread blocks and grids being the main parts. The Cooperative Thread Array (CTA), also called a thread block, makes it easier to handle many threads at the same time. These CTAs are further divided into groups called grids. It is easier to run and manage parallel threads on the GPU because of the tiered structure. This makes CUDA a better tool for using parallel processing power in general. A cool thing about the CUDA code is that it focuses on tiny thread processing. There should be a lot of work to be done, with each task requiring less computer power than the others. Parallel processing on GPUs makes this trait work well with them. This lets many light jobs be done quickly and efficiently at the same time. This lets CUDA's full range of uses in science and computing shine through. The last thing to say is that CUDA is a revolutionary programming language that has turned GPUs into customizable machines for parallel computing. CUDA is one of the biggest names in high-performance parallel computing. Its main goals are lightweight thread processing, tiered ordering of parallel threads, and merger of parallel cores. This paradigm shift has made it possible for science study, models, and computer jobs that need processing power that has never been seen before [27].
Even though the CUDA computing model works very well, it does have some problems that make it less useful generally. One of these restrictions is that you can only make threads and thread blocks by calling a parallel kernel. However, one important limitation is that new threads cannot be created while a parallel kernel is running. This puts limits on dynamic thread generation. Furthermore, CUDA kernels do not allow the use of iterative functions, which means they cannot be added to the reasoning of the kernel. CUDA's design choices have led to these limits, which means that situations that need dynamic thread creation or repeated functions in the kernel need careful thought and different methods. Understanding these limits is important for developers working with the CUDA model because it leads to more careful design choices and different ways to meet specific computing needs [28].
Parallel Algorithms Application in Astronomy
The study of astrophysics has changed a lot since parallel computers, especially multi-core CPUs and GPUs, became popular. That is because we need faster computers right away to solve a lot of astronomy problems. One useful thing about these designs is that they can process many things at once, which speeds up complicated studies in the field of gravity lensing ray-shooting models. Multi-core computers and GPUs are being used more and more by scientists to answer N-body problems, especially ones where the gravity pulls of several celestial bodies interact with each other. It is very helpful to use this kind of computing to model and learn how complicated scientific systems work. Part of orbital physics is the Kepler equation, and these parallel structures speed up processes that help solve it [29]. In radio astronomy, multi-core CPUs and GPUs are being used more and more to make it easier to analyze data from space. Radio stations receive a lot of data, and this progress is very important for getting useful data from that data. We can use these structures to figure out Magnetohydrodynamics (MHD) equations, which help us figure out how complex plasmas in space move. These designs' parallel working power makes it simple to figure out how to figure out the temperature of dust in galaxies, which is a key part of understanding how stars are set up. Having CPUs and GPUs with multiple cores also makes it easier to do math, which helps with the search for gravity waves, an important goal in science. These structures are useful because they can be used to make models of dark matter, which helps scientists keep looking into the universe's building blocks. Parallel computing is becoming more and more important in the area of astronomy as computer problems get harder. Researchers can now look into the mysteries of the world faster and in more depth than ever before thanks to these structures [30].
Task Allocation Between
Our mixed parallel method works best when jobs are split between the CPU and GPU smartly. This depends on the jobs and how hard they are. If you need to do a lot of work at once with a lot of data, like 3D models and difficult matrix operations, the GPU is the best choice. These jobs are helped by the GPU's huge parallel working power, which lets thousands of cores work on separate pieces of data at the same time. For instance, the GPU is very good at computing gradients, divergences, and Laplacians in 3D models because these tasks can be split up and done at the same time on many parts.
The CPU is better for jobs that cannot be done in parallel or need to be done in a certain order, like control logic, task organization, and post-processing. Smaller jobs that do not need as many resources, like 2D models, are handled by the CPU so that sending and receiving data from and to the GPU does not take up too much time. OpenMP is used to spread these smaller jobs across multiple CPU cores so that resources are used efficiently and data flows to and from the GPU do not add to the wait time.
When it is possible, our system runs CPU and GPU jobs at the same time to save even more time. Thus, the GPU can work on more difficult jobs while the CPU performs simpler ones. A good timing system makes sure that all of these processes run at the same time. This cuts down on wasted time and makes the best use of resources. When work on the GPU is finished, the results are sent back to the CPU to be finished. When using this method to divide up tasks, it is important to remember to cut down on the extra work that needs to be done to send data between the CPU and GPU. We carefully give the GPU only jobs that need a lot of data, so we cut down on the number of these exchanges and the time it takes. In the case of a 2D program, the CPU's many parts can do the work well, so the GPU does not need to be used. This saves time during uploading. We use the best parts of both the CPU and the GPU when we share tasks dynamically. This speeds up calculations while keeping their accuracy high.
Convection and Its Calculation
When plasma moves through the heart of a star, hot plasma moves upward and cooler plasma moves downward. There is convection. This process is important for understanding how stars move, especially how convection currents behave. Solar granulation is caused by the flow of gas columns in the convection zone. It shows the convection processes going on below the photosphere of the sun. It looks like the column going up is brighter and hotter, making a beautiful granulation pattern on the sun's surface. Parallelism and two different types of computing are used in this study to look at the computer parts of describing these events. The “reduced speed of sound” method takes the speed of sound into account, which helps us understand how convection works better. “Numerical simulation theory” is a branch of computer science that helps us model and understand how star convection works. Equations used in the computer models show the main things and links that manage convection in stars. By looking at these computer tools, the study's goal is to learn more about how convective processes work. This will help us learn more about things like solar granulation and add to the area of astronomy study as a whole.
where: P, V, S, , and T denote pressure, speed, entropy, density, and temperature respectively. The implement equations are shown by (6–10) equations:
Implementation
Convection calculations are conducted in two-dimensional spaces, represented in Figure 1. The computational domain extends into a three-dimensional space, segmented as illustrated in Figure 2. This arrangement of space makes it easier to look at convection events in a more in-depth way, revealing more about how the system works and how it changes over time. Using both two-dimensional and three-dimensional spaces improves the accuracy and depth of the computer models, giving researchers a solid base for looking into fluid processes in a wide range of situations.
[IMAGE OMITTED. SEE PDF]
[IMAGE OMITTED. SEE PDF]
It is necessary to find the slopes, divergences, and Laplacians for each part of the computing space because it is broken up into pieces of length ℎ. This discretization makes it possible to compute these derivatives in a planned and synchronized way since figuring out the partial derivatives for each part is not contingent on the results for other parts. A strong method using the five-point derivative method is used to find the Laplacian of a scalar variable, the divergence of a vector variable, and the gradient of a scalar variable. Equations (11) through (15) show the steps that need to be taken to use this method. This organized method makes sure that the desired derivatives are calculated quickly and correctly in each part, which adds to the accuracy and dependability of the computing framework as a whole.
One way to use the five-point derivative method is shown visually in Figure 3. This method is an important part of the computational approach because it gives a thorough and organized way to find gradients, divergences, and Laplacians in every part of the discretized computational space. The five-point derivative method is a strong numerical method that is widely used in science computing because it can correctly estimate derivatives. By showing how it is used in Figure 3, we want to emphasize how important it is to make sure that the calculations used in the convection calculations are correct and accurate. The computer model works better overall because this method makes it possible to accurately find key factors that are needed to understand fluid dynamics and flow in the virtual space.
[IMAGE OMITTED. SEE PDF]
The temporal advancement of the numerical solution for the system is achieved through the application of an explicit fourth-order Runge–Kutta scheme. This numerical method denoted as the fourth-order Runge–Kutta, is employed for its proven accuracy and stability in approximating the time evolution of dynamic systems. The fourth-order Runge–Kutta method is characterized by a set of iterative equations, as articulated in relationships 16 through 20. These equations encapsulate the stepwise computations involved in advancing the solution through discrete time intervals. Specifically, the method involves the calculation of weighted averages of function evaluations at multiple points within each time step, contributing to enhanced accuracy in capturing temporal dynamics. In employing the explicit fourth-order Runge–Kutta scheme, this study ensures a robust and precise temporal integration of the numerical solution, offering a reliable framework for simulating and analyzing the evolving behavior of the dynamic system under consideration.
Also, is obtained from the following equations:
Subsequently, the temporal evolution is characterized by encompassing discrete temporal increments. In the context of three-dimensional calculations, a tridimensional matrix is employed, representing distinct perspectives along the axes X, Y, and Z. Conversely, for two-dimensional computations, a bi-dimensional matrix is utilized, accounting for views along the X and Z axes. Each partial or time derivative is encapsulated within its respective matrix, delineating a structured representation of the evolving system across the chosen dimensions.
Also, the time step for the explicit fourth-order Runge–Kutta method is changed dynamically in our simulation based on the Courant–Friedrichs–Lewy (CFL) condition to keep the numbers stable. The CFL number is picked so that . This makes sure that the time step is small enough to stop numerical mistakes from spreading. We can figure out the CFL state by Equation (21).
Sequential Implementation
In the context of sequential implementation for computing partial derivatives within a three-dimensional structure, a triad of nested loops is employed, with each loop corresponding to the X, Y, and Z dimensions. Conversely, for two-dimensional structural considerations, a pair of nested loops is utilized, representing the X and Z dimensions. Within this framework, the partial derivatives of each subsection are systematically computed sequentially. This iterative process involves the calculation of partial derivatives for individual subsections, ensuring a comprehensive evaluation of the system's evolving behavior across the specified dimensions.
Parallel Implementation by
In this implementation, computations about gradients, divergences, and Laplacians are encoded as parallel loops, optimizing the efficiency of the numerical procedures. Specifically, the outermost loop orchestrates parallel threads, harnessing the concurrent processing capabilities of modern computing architectures. Notably, optimal outcomes are achieved when the number of threads employed aligns with the available processor cores, ensuring a harmonious distribution of computational tasks across the parallelized framework. This strategy aims to exploit parallelization for enhanced performance, with a particular emphasis on synchronizing the concurrency of threads with the underlying hardware architecture.
Parallel Implementation by
When using GPUs to speed up processes, CUDA cores make it easier to do complicated calculations like slopes, Laplacians, and divergences. Using GPUs' naturally parallel design and the CUDA core framework, each part of the arrays needed for these calculations is handled at the same time as its corresponding parts. A key factor is the use of 3D thread blocks, which give each component its thread for processing. This choice in design makes sure that all threads align when the calculations for each array are done. This makes the most of GPU systems' ability to do multiple things at once. The most important thing that makes this method unique is that it does not use normal loops in the array math. Each item in the list is given to a different thread instead of loops. This method fits with the idea behind GPU processing, where many threads work on small jobs at the same time. This means that the computing work is spread out over a lot of different working threads, making the most of the parallelism that is built into GPU designs. When the CUDA core processing is over, the computer goes back to the host code. After the calculations are done, the values are moved from the GPU's memory to the RAM. This makes it easy to add the results to the general computing system. It is important to remember that GPU threads are meant to work well with little computing load, while multi-core CPUs are good at running threads that use a lot of resources. This trait emphasizes the complex way threads are implemented, focusing on speed, and parallel processing in the world of GPU-accelerated computing.
Shared memory is very important when using GPUs to speed up processes because it lets threads in the same block talk to each other quickly. This cuts down on the time it takes to reach global memory. Shared memory is used in our CUDA code to store data that is often seen, such as gradient, divergence, and Laplacians half-way results. We can get faster results if we use shared memory instead of global memory calls. This is especially true for 3D calculations where data locality is very important. In our approach, each thread block uses shared memory to work on small amounts of data, so accessing slower global memory is not used as much. This improvement is especially helpful when working with big datasets because it lets more tasks run at the same time and makes better use of GPU resources. The speed gains are mostly due to the use of shared memory, especially when compared to a simple version that only uses global memory.
Combining
In our mixed parallel method, we blend OpenMP and CUDA in a way that makes solar convection simulations run faster. OpenMP is used for multi-threading on the CPU. It handles tasks that work better on multi-core CPUs, like 2D calculations and tasks with low parallelism overhead, which are split between several CPU cores and done in parallel. When it comes to the GPU, CUDA speeds up more difficult tasks that need a lot of data, like making 3D models. The CPU (via OpenMP) handles less demanding tasks where the GPU would have to do extra work. On the other hand, the GPU (via CUDA) handles data-parallel tasks like big matrix calculations and 3D flow models, where all of its cores can be used to their fullest. It was our job to make sure that both modes worked well with each other so that OpenMP tasks could run on the CPU at the same time as GPU processes. When tasks on the GPU are finished, the results are sent back to the CPU to be done one more time. By using OpenMP for CPU parallelism and CUDA for GPU acceleration, this mixed model makes the best use of both CPU and GPU resources. This makes the speed a lot faster.
Results
In the never-ending quest for computing power, this study began a thorough investigation, using the power of two cutting-edge hardware systems to carefully look at how much computing power they had. A cutting-edge octa-core Intel Core i9 processor, a powerful NVIDIA GeForce RTX 3080 GPU, and an extra 32-gigabyte RAM module have been added to the main hardware setup, which is housed in a desktop monster. This group of strong, cutting-edge parts is about to give computers power that has never been seen before. This will make it possible for science models and math formulas to work very well. The second stage, which is now inside a high-tech machine, has also changed a great deal. It now comes with 32 gigabytes of RAM, an Intel Core i9 processor with multiple cores, and a fast NVIDIA GeForce RTX 3070 GPU. To see how well our parallel methods worked with different hardware, we used two different GPUs for this study. The GPUs GeForce RTX 3070 and GeForce RTX 3080 were used the most. As for the laptop, the GeForce RTX 3070 was used. For the PC, the GeForce RTX 3080 was used. These GPUs were tried one at a time, not all at once, to see how well each one could handle the processing needs of solar convection research. This method lets us check how well different GPU designs work and learn more about how each one helps cut down on the time it takes to do both 3D and 2D calculations. With this new technology, the laptop is now a real computing powerhouse that can handle difficult jobs quickly and efficiently. Because these setups were so different, they made it hard to judge how well computers worked. Table 2 shows the computational cases and sizes. The investigational tableau shown in Table 3 includes a wide range of computer situations divided into two-dimensional and three-dimensional applications. These versions can be told apart by using different computer methods, namely the modeling method and the lower speed of sound method. Each situation, marked by the names S0 through S7, has its geographic dimensions (x and z) and scientific methods, which helps us understand how well they work computationally. When looking at the results, an interesting thing becomes clear: using all four computational cores at the same time during both three-dimensional and two-dimensional implementations leads to a substantial increase in processing speed, roughly twice as fast as normal processing speeds. Another thing that shows how useful parallelization methods are is that they work especially well with multi-core CPUs. When you look more closely at three-dimensional calculations done on GPUs, you can see that they speed up processing by an amazing 58 times faster than regular speeds. This big improvement, which is due to the parallel processing features built into GPU designs, proves how useful GPU-accelerated computing is for speeding up complicated computer tasks. These results have effects that go beyond the immediate setting of the experiment. They offer deep lessons for future work in computer science. In line with the main goal of speeding up work, researchers and practitioners are encouraged to think about how to best use system resources. This study shines a light on the way to better computing efficiency and effectiveness, especially in the dynamic areas of numerical calculations and science models.
TABLE 2 Computational scenarios and dimensions.
| Scenario | Method | Dimensions (X) | Dimensions (Z) |
| S0 | Reduced Speed of Sound | ||
| S1 | Reduced Speed of Sound | ||
| S2 | Reduced Speed of Sound | ||
| S3 | Reduced Speed of Sound | ||
| S4 | Simulation Method | ||
| S5 | Simulation Method | ||
| S6 | Simulation Method | ||
| S7 | Simulation Method |
TABLE 3 Time taken to perform calculation on the first configuration.
| Process | 1 CPU | 8 CPU | GPU |
| S0 | 5665 | 1995 | 430 |
| S1 | 2635 | 928 | 207 |
| S2 | 23 | 7 | 42 |
| S3 | 68 | 20 | 125 |
| S4 | 1129 | 435 | 100 |
| S5 | 2508 | 895 | 206 |
| S6 | 9 | 3 | 17 |
| S7 | 30 | 10 | 52 |
Because two-dimensional calculations are usually small, the extra work that goes into moving data from RAM to the GPU's memory slows them down. The problems caused by extra data transfer costs can be seen in Figures 4 and 5, which also show the outcomes of these methods.
[IMAGE OMITTED. SEE PDF]
[IMAGE OMITTED. SEE PDF]
The utilization of all eight processor cores in the implementation of both 3D and 2D computations, as indicated in Table 4, leads to a substantial threefold enhancement in computational speed. This notable improvement can be attributed to the parallel processing capabilities afforded by the multi-core configuration, allowing the system to concurrently handle multiple tasks. The efficient allocation of computational tasks across the available cores enhances throughput, reducing overall processing time and contributing to a more efficient and rapid execution of complex computations. This underscores the significance of leveraging multi-core architectures for computational tasks, especially those involving intricate 3D and 2D calculations, where parallelization can significantly boost performance.
TABLE 4 Time is taken to perform calculations on the second configuration.
| Process | 1 CPU | 4 CPU | GPU |
| S0 | 5442 | 2948 | 96.307 |
| S1 | 2520 | 1386 | 43.542 |
| S2 | 22.745 | 10.28 | 13.567 |
| S3 | 66.137 | 30.342 | 40.37 |
| S4 | 1125 | 653 | 22.495 |
| S5 | 2486 | 1377 | 44.257 |
| S6 | 10.46 | 4.54 | 5.419 |
| S7 | 30.141 | 18.805 | 15.905 |
In the realm of 3D GPU implementation, a remarkable surge in computation speed, nearly 15 times greater, is observed. However, the landscape changes when delving into 2D computation. Here, the relatively diminutive size of data becomes a limiting factor, leading to the emergence of challenges associated with CPU-GPU interaction overhead and data transfers. These factors collectively contribute to a reduction in computing speed for 2D computations. Figures 6 and 7 visually encapsulate the outcomes of these implementations, offering a comprehensive representation of the performance differentials between 3D and 2D scenarios. This juxtaposition highlights the intricate dynamics at play, showcasing the disparities in computational efficiency based on the dimensionality of the implemented tasks.
[IMAGE OMITTED. SEE PDF]
[IMAGE OMITTED. SEE PDF]
Figure 8 shows how grid resolution changes the amount of time it takes to run 3D models of solar convection using both CPU-based and GPU-based parallelization methods. The grid precision is shown on the X-axis as the number of grid points, and the processing time is shown on the Y-axis as seconds. The OpenMP results for the CPU are shown in red, and the CUDA results for the GPU are shown in blue. As assumed, the picture shows that as the grid precision gets higher, it takes longer to do the calculations. When the grid size is smaller, like 643 or 1283, both the CPU and GPU work about the same, with the CPU being a little slower. This is mostly because the GPU has to constantly send and receive data between the CPU and GPU memory, which slows it down for smaller tasks. But as the grid precision goes up, the GPU starts to perform much better than the CPU. To run 3D solar convection models at higher resolutions like 2563 and 5123, large files need to be handled much better by the GPU. It does the math much faster than the CPU because of this. Making the grid bigger takes a lot more time on the CPU, which shows that it cannot handle bigger numbers well. But the GPU is much better at growing since it can do more at once. Even with a 5123 grid, the GPU finishes the exercise almost four times faster than the CPU. This is proof of how useful CUDA-based parallelization is for models of space and time that need a lot of computer power. This work shows how important it is to use GPU resources in large-scale models, where the benefits of parallelization become more clear as the processing gets harder.
[IMAGE OMITTED. SEE PDF]
It takes both CPU-based and GPU-based parallelization to run models of solar convection. Figure 9 shows how much energy flows and how long it takes. The Y-axis shows the amount of time it took to do the math in seconds, and the X-axis shows the flow of energy in watts per square meter (W/m2). The results from the GPU are shown in blue (CUDA), and the results from the CPU are shown in red (5MP). Both the CPU and GPU take a lot longer to do math when the flow of energy goes up. The CPU, on the other hand, keeps working time going up more steadily and slowly, especially when the energy flux is high (10,000 and 20,000 W/m2). As the need for computing grows, the GPU is better able to meet it because it can handle bigger jobs with less slowdown because it is better at parallelization. On the other hand, the CPU takes a lot longer to work when the energy flow goes up, especially when the flux is between 10,000 and 20,000 W/m2. We can say that the CPU struggles to stay stable and work well when the energy level is high. However, the GPU stays stable even when the flux level is very high. It can better handle models that need a lot of power. The GPU is better because it can do more work at once. For models that need a lot of energy flow, this is important. While the CPU can handle smaller flux levels, it may not be steady when flux levels get too high. GPU speeding is important for cosmic models that need a lot of power.
[IMAGE OMITTED. SEE PDF]
Sun convection models can be run faster with CPU-based or GPU-based parallelization, as shown in Figure 10. The time it takes depends on the CFL condition and the size of the time steps. The Y-axis shows the working time in seconds, and the X-axis shows the state of the CFL. Larger CFL values mean larger time step sizes. The red line shows how fast the CPU is (OpenMP), and the blue line shows how fast the GPU is. When the CFL number is between 0.2 and 0.5, both the CPU and GPU work well and take a short time to do calculations. But as the CFL condition gets higher (between 0.8 and 1.0), the CPU starts to have trouble, and the time it takes to do calculations goes up sharply. This shows that the CPU has trouble with bigger time steps, most likely because it cannot easily divide up the heavy computing load across multiple cores. At the highest CFL number (CFL = 1.0), the CPU may even experience numerical instability or slower convergence, which shows that it cannot keep things stable in those situations. The GPU, on the other hand, has better speed and scalability, even as the CFL situation gets worse. Even though it takes longer to do the math with bigger time steps, the GPU always does better than the CPU, especially when CFL = 0.8 and 1.0. The GPU can handle more computations with few problems because it stays stable and handles bigger time steps with better convergence. Overall, looking at Figure 10 shows that both the CPU and the GPU work well for smaller time steps, but the GPU is much better and more reliable for larger time steps, keeping things stable and cutting down on processing time while the CPU battles. This shows how important it is to use GPU parallelization for complicated models that need bigger time steps and CFL conditions.
[IMAGE OMITTED. SEE PDF]
Figure 11 shows what happens to the time it takes to run 3D solar convection models when the number of parallel threads (for CPU) and blocks (for GPU) is increased. The number of threads (for CPU) or blocks (for GPU) is shown on the X-axis, and the time it takes to do the math is shown on the Y-axis in seconds. The red line shows how fast the CPU is (OpenMP), and the blue line shows how fast the GPU is (CUDA). As assumed, the amount of multiple threads or blocks makes both the CPU and GPU much faster at doing calculations. There is a big drop in processing time for the GPU from two blocks to eight, which shows how well CUDA works at spreading jobs across many blocks. But after 16 blocks, the efficiency gains start to slow down, which is called diminishing returns. This means that after a certain number of blocks, the GPU's full potential is reached, and adding more blocks does not help much anymore. In the same way, the CPU works better as the number of threads goes up, and between 2 and 8 threads, processing time drops noticeably. But at 16 threads, the CPU also hits a limit where adding more threads does not make a big difference in the amount of time it takes to do something. It looks like this behavior means that the CPU cannot increase parallelism past a certain point because of how it is built. Findings from the study of Figure 11 show that adding more parallel threads (CPU) or blocks (GPU) can make performance much better, but the gains stop growing after a certain point. As the level of parallelism goes up, both the CPU and the GPU show diminishing returns. This means that the number of threads or blocks must be optimized based on the hardware's capabilities.
[IMAGE OMITTED. SEE PDF]
Conclusion and Future Work
Most individuals currently know that multi-core computers and GPUs can speed up numerical models, but our work is different in a few important ways. Our parallelization method is specifically designed for both 2D and 3D models, and we give a thorough explanation of how different computer jobs (2D vs. 3D solar convection simulations) gain from various approaches. To get more done with your computer, use the best parts of both types of tools for each task. Computers with more than one core are better at doing 2D calculations, while GPUs are much faster at doing 3D calculations. Even faster, we recommend a model that uses both OpenMP and CUDA for parallelization on both the CPU and the GPU. This model works with multi-core CPUs and GPUs. This is what will make sure jobs are given to the right people and that both the CPU and GPU are used. Because we use both at the same time, our way of doing things is better than other ways that use them individually. You can find out more by reading our detailed reviews of the GeForce RTX 3070 and RTX 3080 hardware product's physics. These will help you understand how different hardware setups affect how well astrophysical models work. We have improved the speed of computers used in astrophysical simulations by combining different types of hardware and custom parallelization techniques. In conclusion, our investigation into computational methodologies for convection calculations has yielded valuable insights. The utilization of GPUs has undeniably demonstrated a remarkable surge in the speed of three-dimensional computations, enhancing their efficiency significantly. However, in the context of 2D computation, the presence of overhead resulting from CPU–GPU interaction and data transfers curtails the affordability of GPU deployment. In contrast, the implementation of multi-core processors exhibits promising potential, offering a threefold increase in computing speed. Our focus on convection calculations through Cartesian coordinates methods, excluding the magnetic field calculation, has provided a foundational understanding.
Our current approach takes advantage of GPUs' parallel computing features by running tasks mostly in global memory. We want to improve speed even more in the future by adding shared memory to the GPU thread blocks. We want to cut down on memory access delay and improve data locality by using shared memory. This is especially important for jobs that need a lot of computing power, like 3D models. This change is thought to make the game run much faster and better overall. The next part of our study will look at how it changes results and how resources are used. Deep learning models can help find complicated patterns and connections, so future work should look into how to use them in these tasks. This might help us make more accurate guesses and learn more about how the flow works on a deeper level. It is important to look into GPU acceleration in deep learning systems because it could lead to computer speeds and efficiency that are hard to imagine. However, issues with getting to the data, understanding the model, and fine-tuning for specific scenarios need to be fixed so that future research can yield strong and correct results. When we think about how to use computers to do flow calculations in the future, adding deep learning stands out as an interesting way to go. Adding deep neural networks to flow predictions could make them much more accurate and faster. Deep learning models are very good at finding complex patterns and links in data. This means that they could help make predictions more accurate and better understand the complexities of convection processes. Looking into GPU acceleration in the context of deep learning frameworks could lead to computer speeds that have never been seen before, models that run faster, and a better understanding of how fluid dynamics work. Having said that, it is very important to be aware of and deal with issues that arise when accessing data, figuring out how to use models, and fine-tuning for specific flow situations. Deep learning-enhanced computing methods that have been mentioned should be studied in more depth in future research projects to make sure they are strong and reliable. Adding deep learning to convection calculations is not only a big step forward in technology, but it is also a big step toward better accuracy and predictability in studying complex fluid dynamics.
Conflicts of Interest
The authors declare no conflicts of interest.
Data Availability Statement
Data sharing is not applicable to this article as no new data were created or analyzed in this study.
S. Kawtikwar and R. Nagi, “HyLAC: Hybrid Linear Assignment Solver in CUDA,” Journal of Parallel and Distributed Computing 187 (2024): [eLocator: 104838].
W. Yang, J. Fang, D. Dong, X. Su, and Z. Wang, “Optimizing Full‐Spectrum Matrix Multiplications on ARMv8 Multi‐Core CPUs,” IEEE Transactions on Parallel and Distributed Systems 35 (2024): 439–454.
L. Tong, H. Zhou, and B. Sheil, “Multicore CPU‐Based Parallel Computing Accelerated Digital Image Correlation for Large Soil Deformations Measurement,” Computers and Geotechnics 166 (2024): [eLocator: 106027].
Z. Amiri, A. Heidari, M. Darbandi, et al., “The Personal Health Applications of Machine Learning Techniques in the Internet of Behaviors,” Sustainability 15, no. 16 (2023): [eLocator: 12406].
M. Saeed Sharifian, V. Rashtchi, and A. Azarpeyvand, “Parallel Chaos‐Based Image Encryption Algorithm: High‐Level Synthesis and FPGA Implementation,” Journal of Supercomputing 80 (2024): 10985–11013.
E. Alhenawi, R. Khurma, R. Damaševic̆ius, and A. Hussien, “Solving Traveling Salesman Problem Using Parallel River Formation Dynamics Optimization Algorithm on Multi‐Core Architecture Using Apache Spark,” International Journal of Computational Intelligence Systems 17, no. 1 (2024): 1–14.
Ö. Dülger and T. Dökeroğlu, “A New Parallel Tabu Search Algorithm for the Optimization of the Maximum Vertex Weight Clique Problem,” Concurrency and Computation: Practice and Experience 36, no. 2 (2024): [eLocator: e7891].
A. Munera, S. Royuela, M. Pressler, H. Mackamul, D. Ziegenbein, and E. Quiñones, “Fine‐Grained Adaptive Parallelism for Automotive Systems Through AMALTHEA and OpenMP,” Journal of Systems Architecture 146 (2024): [eLocator: 103034].
F. Gregoretti, G. Pezzulo, and D. Maisto, “CPP‐AIF: A Multi‐Core C++ Implementation of Active Inference for Partially Observable Markov Decision Processes,” Neurocomputing 568 (2024): [eLocator: 127065].
J. M. Rodríguez‐Borbón, X. Wang, A. P. Diéguez, K. Z. Ibrahim, and B. M. Wong, “TRAVOLTA: GPU Acceleration and Algorithmic Improvements for Constructing Quantum Optimal Control Fields in Photo‐Excited Systems,” Computer Physics Communications 296 (2024): [eLocator: 109017].
A. Sabu, C. Liu, and T. E. Carlson, Viper: Utilizing Hierarchical Program Structure to Accelerate Multi‐Core Simulation (Piscataway, NJ: IEEE Access, 2024).
Z. Amiri, A. Heidari, N. J. Navimipour, M. Esmaeilpour, and Y. Yazdani, “The Deep Learning Applications in IoT‐Based Bio‐and Medical Informatics: A Systematic Literature Review,” Neural Computing and Applications 36 (2024): 5757–5797.
X. Yang, S. Li, F. Yuan, D. Dong, C. Huang, and Z. Wang, “Optimizing Multi‐Grid Computation and Parallelization on Multi‐Cores,” in Proceedings of the 37th International Conference on Supercomputing (2023).
A. S. Jermyn, E. B. Bauer, J. Schwab, et al., “Modules for Experiments in Stellar Astrophysics (MESA): Time‐Dependent Convection, Energy Conservation, Automatic Differentiation, and Infrastructure,” Astrophysical Journal Supplement Series 265, no. 1 (2023): 15.
H. Nogueira, G. Guerrero, P. Smolarkiewicz, and A. Kosovichev, “Numerical Convergence of 2D Solar Convection in Implicit Large‐Eddy Simulations,” Astrophysical Journal 928, no. 2 (2022): 148.
Y. Bekki, R. H. Cameron, and L. Gizon, “Theory of Solar Oscillations in the Inertial Frequency Range: Amplitudes of Equatorial Modes From a Nonlinear Rotating Convection Simulation,” Astronomy and Astrophysics 666 (2022): A135.
H. Iijima, T. Matsumoto, H. Hotta, and S. Imada, “A Comprehensive Simulation of Solar Wind Formation From the Solar Interior: Significant Cross‐Field Energy Transport by Interchange Reconnection Near the Sun,” Astrophysical Journal Letters 951, no. 2 (2023): L47.
D. Herrero‐Pérez and S. Picó‐Vicente, “Adaptive Fail‐Safe Topology Optimization Using a Hierarchical Parallelization Scheme,” Computers and Structures 291 (2024): [eLocator: 107205].
Z. Amiri, A. Heidari, N. J. Navimipour, M. Unal, and A. Mousavi, “Adventures in Data Analysis: A Systematic Review of Deep Learning Techniques for Pattern Recognition in Cyber‐Physical‐Social Systems,” Multimedia Tools and Applications 83 (2023): 22909–22973.
B. Acharya, S. Panda, and N. K. Ray, “Multiprocessor Task Scheduling Optimization for Cyber‐Physical System Using an Improved Salp Swarm Optimization Algorithm,” SN Computer Science 5, no. 1 (2024): 184.
V. Isaac–Chassande, A. Evans, Y. Durand, and F. Rousseau, “Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A Survey,” ACM Transactions on Architecture and Code Optimization 21 (2024): 1–26.
L. Cheng, Y. Gu, Q. Liu, L. Yang, C. Liu, and Y. Wang, “Advancements in Accelerating Deep Neural Network Inference on AIoT Devices: A Survey,” IEEE Transactions on Sustainable Computing (2024): 1–18.
M. Yang, C. Li, Y. Tang, W. Wu, and X. Zhang, “A Collaborative Resequencing Approach Enabled by Multi‐Core PREA for a Multi‐Stage Automotive Flow Shop,” Expert Systems with Applications 237 (2024): [eLocator: 121825].
Z. Amiri, A. Heidari, N. J. Navimipour, and M. Unal, “Resilient and Dependability Management in Distributed Environments: A Systematic and Comprehensive Literature Review,” Cluster Computing 26, no. 2 (2023): 1565–1600.
R. R. Expósito and J. González‐Domínguez, “BigDEC: A Multi‐Algorithm Big Data Tool Based on the K‐Mer Spectrum Method for Scalable Short‐Read Error Correction,” Future Generation Computer Systems 154 (2024): 314–329.
B. Mandal, D. Bostan, C. Joy, and D. Babikov, “MQCT 2024: A Program for Calculations of Inelastic Scattering of Two Molecules (New Version Announcement),” Computer Physics Communications 294 (2024): [eLocator: 108938].
J. Liu, X. Yang, Z. Zhang, and M. Liu, “A Massive MPI Parallel Framework of Smoothed Particle Hydrodynamics With Optimized Memory Management for Extreme Mechanics Problems,” Computer Physics Communications 295 (2024): [eLocator: 108970].
P. Benedusi, S. Riva, P. Zulian, J. Štěpán, L. Belluzzi, and R. Krause, “Scalable Matrix‐Free Solver for 3D Transfer of Polarized Radiation in Stellar Atmospheres,” Journal of Computational Physics 479 (2023): [eLocator: 112013].
J. López‐Miralles, J. M. Martí, and M. Perucho, “On the Application of Jacobian‐Free Riemann Solvers for Relativistic Radiation Magnetohydrodynamics Under M1 Closure,” Computer Physics Communications 284 (2023): [eLocator: 108630].
H. Mikram, S. El Kafhali, and Y. Saadi, “HEPGA: A New Effective Hybrid Algorithm for Scientific Workflow Scheduling in Cloud Computing Environment,” Simulation Modelling Practice and Theory 130 (2024): [eLocator: 102864].
© 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.