Content area
Dataflow computing has proved to be more efficient for certain high-performance computing algorithms. The prerequisites are that there is enough parallel calculation to cover the overhead of executing instructions on the dataflow hardware until the first result is ready. This is often true with algorithms that work with big data and that can process multiple iterations independently, e.g., while simulating certain phenomena in many elementary volumes. However, dataflow hardware runs typically at an order of magnitude lower frequencies compared to the control-flow processor. From a programmer’s point of view, programming dataflow architectures is considerably harder than programming control-flow architectures. As a result, it is not always obvious whether programming dataflow architectures for certain algorithms is worth the effort needed. Therefore, there is a need for a programmer to be able to predict the outcome of programming for dataflow architectures in terms of accelerating program execution and power savings. This article presents a newly developed tool that a programmer can use for profiling control-flow algorithms and estimating the acceleration possibilities using the dataflow hardware.
Introduction
For more than a decade, we have witnessed the leaning from improving high-performance computing (HPC) architectures by increasing frequencies towards increasing the number of instructions that can be executed in parallel. Both multicore and so-called manycore processors have more and more cores each year. Besides increasing the number of cores of control-flow processors, the dataflow programming model, as an alternative to the control-flow programming model, has gained importance. Dataflow computing [1, 2] offers more parallelism and is therefore more efficient for certain algorithms which include repetitive execution of the same set of instructions. With lower frequencies, it is also more energy efficient compared to their control-flow counterparts. However, using a dataflow architecture [3, 4] has additional costs. Programming dataflow architectures is considerably harder than programming control-flow architectures for many reasons [5]. The logic is different than the one programmer has been predominantly taught for. Compilers and supporting tools are less developed and, also, there are fewer integrated development environments (IDEs). For all these reasons, there are not so many dataflow as control-flow hardware programmers. This paper presents an approach to predict the acceleration possibilities if an algorithm is transformed from the control-flow programming model into the dataflow one [6]. Before the problem definition, a brief overview of both control-flow and dataflow architectures will be presented, along with programming languages and frameworks that are used for programming the architectures. The hypothetical processor that incorporates both control-flow and dataflow computing models on the same chip is also presented.
Control-flow architectures
Control-flow architectures are based on the principles defined by von Neumann [7]. The central processing unit (CPU) executes instructions in a sequential order using the arithmetic logic unit (ALU). Executing each instruction requires performing a set of tasks. At the beginning of an instruction execution, the instruction has to be fetched from the memory. The instruction memory location is determined by the program counter (PC). After fetching the instruction, this counter has to be increased by the size of the instruction, so that it points to the next instruction in the main memory. Once the instruction is fetched, the decoding process determines the type of the instruction to be executed, and which operands have to be fetched from the memory. Based on the location of operands, operand addresses have to be evaluated, and operands have to be fetched into registers. Once input data is provided, the execution phase of the ALU can be performed, followed by the saving result process that has to determine the location where the result should be saved, calculate the memory address if needed, and copy the result to the destination if needed. These principles remained for decades and are still in use.
With a growing demand for HPC applications, computer architectures based on control-flow have been improved. Since the 1980’s, these architectures have been predominantly improved by raising frequencies. Currently, their frequencies typically range from around 3GHz to around 5.2GHz. Further rapid increases in frequencies lead to unreasonably high costs in terms of electrical power needed, but also cooling. As a result, we are witnessing a raising number of cores in recent decades. This applies also to so-called many-core architectures counting thousands of cores that we can find in graphics cards.
In parallel with increasing the performance of processors, the architecture of complex computer systems that include multiple computers or nodes has further evolved. These architectures exist in both computer clusters and clouds. Cluster computing is predominantly used in computer centers that consist of many nodes connected through a local area network so that they can act as a single machine working on a single algorithm, or many algorithms simultaneously. Cluster nodes often use the same type of hardware. Nodes typically include several CPUs and are in some cases accompanied by auxiliary hardware accelerators, such as computer graphics cards, or dataflow hardware. Nodes are scheduled to perform specific tasks for the cluster to produce the output of a program. Splitting the execution on multiple nodes (if possible) leads to solving complex operations faster than a single CPU would.
In contrast to cluster computing, cloud providers provide infrastructure to the public or certain parties. The cloud can incorporate heterogeneous computing nodes. From a user’s perspective, there are as many computing resources as he has paid for. Compared to the execution of an algorithm using an owned computer, cloud computing is more expensive, but in the case, that a user wants to execute identical copies of a program in parallel on many computer nodes once per month, it is usually more reasonable to pay only for what he needs.
One of the key points of control-flow architectures it that the architecture allows instructions to be executed in arbitrary order. A compiler defines which instructions should run and in which order. Any of the instructions defined by the architecture is supported.
The bad side of the control-flow computing model is that the CPU has typically around one billion transistors. These are needed for supporting all of the instructions, cache memories, etc. However, only up to a few instructions can be run simultaneously. Since only a few transistors are enough to perform single-bit addition or subtraction, around a thousand transistors are sufficient for adding two 64-bit numbers. Roughly said, a control-flow architecture is capable of executing a million times more instructions per second than it does. Of course, this would require preparing too much input data per second to the processor, which would be impossible with the currently available speeds of buses. This problem can only be partially solved by introducing various techniques [8, 9]. Caches also impose limitations in parallel processing [10].
Control-flow languages
Control-flow architectures have been predominantly in use for more than half of the decade. As a result, many programming languages have been introduced. They can be divided into procedural and non-procedural. We will focus on optimizing algorithms written in procedural languages, e.g.:
- Machine languages,
- C, Java, Python, etc.
- Object-oriented languages,
- Scripting languages, and.
- Dataflow languages.
First computers were programmed directly using machine languages. Computer programs were relatively hard to write, but also to understand. They are hardly used nowadays by programmers. As the capacities of computers improved, so did the languages. Soon after procedural languages appeared, so did the functional and object-oriented languages. Recently, although relatively slow, scripting languages are gaining in popularity due to their ability to express in short what needs to be executed.
The main distinguishing features of new computer languages are their ability to write computer programs by using statements that are close to the human thinking process and their extensive relying on library functionalities (templates, functions, etc.).
There are software and hardware dataflow models. Hardware dataflow assumes that there is hardware that should perform operations in parallel. On the other side, software dataflow enables program development for control-flow types of architectures by defining dataflow graphs of execution. Software dataflow languages are often high-level programming languages that define the behavior of processes that should run in parallel. They are meant for distributed computing, often using a special-purpose computer architecture, e.g., pipeline processors. The main goal of these languages is to define input and output data and the functionality in terms of operations that have to be performed.
Programmers can also treat parallel computing platforms with programming libraries as programming languages. For example, CUDA is developed for the efficient programming of computer graphics cards to execute algorithms that are not necessarily related to drawing something on screens. Apache Hadoop represents another framework for programming computers to accelerate the processing of massive amounts of data and a relatively high amount of computation, exceeding the possibilities of desktop computers. Some libraries implement a parallel execution model. They can usually be used with many programming languages. For example, MPI and OpenMPI can be used with Fortran, C, C + + , and Java. What is in common for frameworks for parallel programming, libraries, and parallel programming languages is that they provide synchronization constructs between processes or threads of parallel execution.
Dataflow architectures
Dataflow architectures have been studied almost since control-flow architectures. Early work was often based on systolic arrays, and often organized to achieve the best performance for certain algorithms [11]. The dataflow architecture program execution is performed by moving the data from one processing element (PE) to another. Later, dataflow architecture development benefited from using Field-Programmable Gate Arrays (FPGAs) that enabled reconfiguring the hardware [12].
Unlike control-flow architectures that assume having a computer program that instructs a processor what instructions have to be executed and in which order, in the case of dataflow computer architectures, the hardware, for each program, is configured for the data to flow through it, as shown on Fig. 1. This brings multiple benefits that lead to shortening the total execution time.
[See PDF for image]
Fig. 1
The dataflow computing model
Dataflow architectures do not suffer from bus congestion, as the data flows from each source processing element to the destination processing element or elements in parallel.
Processing elements forward their results to the ones that need them. Therefore, there is no need to put results into the memory and fetch them later for other processing elements to process them.
Dataflow architectures do not need complex control-flow logic. As a result, transistors that would otherwise be occupied with this logic could be utilized for processing elements.
Cache memories usually occupy around half of the total number of transistors on control-flow processors. In the case of dataflow architectures, there is no need for a complex cashing mechanism.
The bad side of the dataflow computing model is that programming such architectures requires awareness of limited computer resources. While control-flow architectures can execute any instruction defined by computer programs, dataflow must be configured to execute only a certain set of instructions. Another disadvantage of the dataflow computing model is that the CPU is still needed for preparing the data for processing using the dataflow hardware. Besides hardware constraints, programming dataflow architectures require more programming efforts.
Dataflow languages
The most famous computer languages used for configuring FPGAs are VHSIC Hardware Description Language (VHDL) and Verilog. However, their usage requires a user to understand the hardware components and configure the hardware gradually.
The opposite of configuring the hardware by a user is to use a developed framework that will configure the hardware automatically based on the dataflow kernel (portion of the algorithm that will run on the dataflow hardware). The kernel can be programmed using a custom-defined language and the Maxeler framework [13]. Researchers have developed various programming languages that help to program dataflow kernels [14]. The example of profiling that will be shown in this article is based on the Lattice-Boltzmann method implemented for the dataflow hardware [15]. The implementation of the Lattice-Boltzmann method includes the Dataflow kernels implemented using MaxJ language [16] for defining the behavior of the dataflow hardware and a managing class (often called a manager) that handles streams between the CPU host and dataflow hardware. Many control-flow algorithms have been translated into the dataflow representation using this approach [17, 18, 19, 20–21].
Besides dataflow engines and frameworks, there is also an IDE available for programming dataflow hardware. Figures 2 and 3 present a dataflow programming IDE where a user can program dataflow hardware directly on the server. The IDE allows users to program CPU code, dataflow kernels, and managers. Figure 2 shows the control-flow programming environment, while Fig. 3 shows Dataflow programming environment.
[See PDF for image]
Fig. 2
Dataflow programming IDE: control-flow
[See PDF for image]
Fig. 3
Dataflow programming IDE: dataflow
Many optimization techniques can be applied to algorithms developed for standard CPU and graphics processing unit (GPU) computing systems. Dataflow programming adds layer of complexity. Developers can also exploit the parallelism possibilities in loop unrolling, they should utilize spatial computation, reorganize the code to avoid stalls in the execution, store data in the dataflow hardware for faster access, and optionally reduce the precision of real numbers so that the dataflow hardware runs faster.
Hybrid architectures
Besides control-flow and dataflow architectures, researchers have been developing hybrid processors consisting of both control-flow and dataflow hardware [22, 23]. Hybrid processors can optionally incorporate the Internet of things on the same chip die as well, as is the case with Ultimate dataflow architecture [24, 25]. The control-flow hardware includes both typical multicore CPU, and so-called manycore architectures, like the ones that can be found in graphic cards (GPUs). This can be considered as implantation, where an existing approach is utilized in a new field, bringing higher performance [26]. Figure 4 presents an example of a hybrid control-flow and dataflow processor, along with GPUs. NoC represents Network on a chip, and Ethernet represents Ethernet controller.
[See PDF for image]
Fig. 4
Structure of the Hybrid processor with control-flow and dataflow components
The benefit of the proposed architecture is that both the control-flow and the dataflow hardware are on the same chip so that the communication time between them is shorter than it would be otherwise.
The disadvantage of these hybrid architectures is that they consume considerably more space and have a higher number of transistors, but in the case, that there is no job for the dataflow hardware, it stays underutilized. Therefore, they are suitable solely for high-performance computing, where the expected occupation of each of the available resources can be predicted relatively accurately. Other drawbacks of the hybrid processor are that the programming effort is considerably higher compared to the programming for control-flow processors and that the probability of the failure of such a processor is somewhat higher, although the expected lifespan might be almost the same [27, 28].
Profiling program execution for determining acceleration possibilities
The brief analysis of the architectures presented in this section leads to the conclusion that only a subset of high-performance computing algorithms is suitable for computer architectures that include dataflow hardware. Most of the high-performance computing algorithms are already available for the control-flow architectures. Profiling their execution can provide useful feedback on the acceleration possibilities by using the dataflow hardware. The profiling assumes extracting the necessary data during the execution of the algorithm. In the case of estimating dataflow acceleration potentials, these data include the amount of instructions that need to be implemented on the dataflow hardware and the frequencies of executing these instructions. Both of these depend on types of high-performance algorithms, and the input data. Before the programmer starts the process of modeling the dataflow hardware to execute an algorithm, it would be beneficial to estimate the acceleration factor. As a result, a programmer can detect various user scenarios (i.e., program inputs), and test the execution of an algorithm using the control-flow hardware before investing efforts to implement an algorithm using the dataflow hardware. He would be interested in identifying those instructions that would be justified to run on the dataflow hardware. The set of instructions depends on the input data as well. Therefore, the profiling can give useful feedback on which portions of the algorithm are justified to be implemented on the dataflow hardware depending on the input data.
The rest of the paper is organized as follows. Sect. “Problem Statement: Estimating Dataflow Acceleration Potentials” defines the problem of estimating dataflow acceleration potentials. Sect. “Existing Research” presents state-of-the-art solutions in this domain. Sect. “Essence of the Proposed Solution and its Potentials” describes the proposed solution of profiling control-flow program execution to identify machine instructions and estimate acceleration possibilities. Sect. “Experimental Analysis of the Proposed Solution” presents an experimental analysis of the proposed solution using the Lattice-Boltzmann method as an example. This section is followed by the validity of results and conclusion sections.
Problem statement: estimating dataflow acceleration potentials
Programming dataflow architectures are known to require more effort compared to programming control-flow architectures [27]. As a result, programmers tend to reuse the control-flow algorithm implementations when programming the dataflow hardware, by applying available tools to automatically transform programs. Often, it is not clear whether transforming a program from control-flow to the dataflow paradigm can be justified. In some cases, the resulting software runs slower than the control-flow one. Sometimes dataflow hardware doesn’t accelerate program execution as a programmer would expect it to. Therefore, it is important to be able to predict the acceleration before programming dataflow hardware.
Even if the algorithm can be accelerated using the dataflow hardware, it is hard to estimate what is the minimum size of the data to be processed, so that the programming of dataflow hardware can be justified.
Various techniques have been tried. Researchers have been working on automatically or semi-automatically translating control-flow software into dataflow software for decades.
Existing research
Algorithm profiling enables tracking the execution of the algorithm and can be utilized for many reasons. This chapter explains some of the representative research in the field and discusses their potential for the estimation of the acceleration of the control-flow software using the dataflow hardware.
Researchers have tried to profile and trace programs by inserting profiling source code inside of an existing source code in an optimal way so that the profiling doesn’t affect the total execution of the program more than necessary [29]. The goal of the research was to aim at profiling important aspects of the software. However, this cannot be applied to efficiently and precisely estimate dataflow potentials without expert involvement in the process, as arithmetic statements, in general, translate to many machine-level arithmetic instructions, and the potential dataflow acceleration of evaluating the statement is almost directly proportional to the number and type of these arithmetic instructions.
A semantic space profiler for parallel functional programs can be utilized, so that programmers can understand the impact of different scheduling policies while hiding scheduling implementation details [30]. The profiling is based on cost semantics and is applicable where the software needs to be parallelized using the control-flow processors by parallelizing the execution, but it cannot be efficiently used for determining dataflow acceleration potentials.
Path profiling is an efficient way to determine how many times each acyclic path in a routine executes [31]. This applies to estimating the portions of the algorithm that should be translated onto the dataflow hardware. However, the granularity of the parallelization is on the kernel-level, and cannot be utilized on the instruction level. The same applies to the graph execution profiler [32].
Researchers have utilized a profiler to collect statistical information from unmodified MapReduce programs so that the fine-grained cost estimation can be evaluated. It is reasonably detailed but aimed towards MapReduce systems [33], and not dataflow hardware acceleration, where single machine instruction acceleration possibilities are of interest.
Researchers have been working on profiling for performance debugging [34]. Their profiler pinpoints performance bottlenecks and is also able to find optimization opportunities. As such, the profiler is also partially applicable to the problem of this article. However, it doesn’t profile machine-level instruction execution.
Other profiling techniques focus on dataflow graph analysis for improving performances by measuring frequencies of paths or tracking certain aspects of algorithm execution. But, to the best of the author’s knowledge, there is no solution to the profiling control-flow algorithm execution to estimate how much can dataflow accelerate the execution.
Essence of the proposed solution and its potentials
It is possible to predict the acceleration possibilities before configuring the dataflow hardware by simulating the execution of the software on the dataflow hardware. This can be achieved by logging the execution flow of the control-flow algorithm and estimating, based on the known execution speed of the dataflow hardware, how much the dataflow hardware could accelerate the software.
Each arithmetic operation can be logged. Based on the total number of arithmetic operations and dependencies between operands, one could estimate the level of parallelism on the dataflow hardware. Further, the timing constraints can be predicted. Therefore, the clock speed of the dataflow hardware can be estimated, leading to the estimated software execution time using the dataflow hardware.
The process of predicting the acceleration of the control-flow software will be briefly explained, and then the C + + profiler will be presented gradually by showing the most simplified version of the profiler first. One can consider this process as a developing phase of the project. This way, the reader can follow the development process and reproduce it. The C + + source code will be shown for important aspects of estimating the acceleration possibilities.
Profiling control-flow program
The process of profiling consists of the following steps:
- Wrapping each operator call.
- Storing statistical data within each operator call.
- Storing logs for further reference with operational code.
- Processing statistical data and logs for estimating dataflow execution time.
- Comparing control-flow execution time, or estimated control-flow execution time to the estimated dataflow execution time.
Wrapping operators can be performed in C + + language using the template class that represents all primitive types and overloading arithmetic operators. Within each operator call, the necessary administration can be performed.
Storing statistical data is needed for the profiling process that comes after the algorithm has been executed using a CPU. Several different operators have to be recorded. This can be achieved by altering the number of iterations of the for loop. If the iteration is executed only once, several different operations will be equal to the sum of the number of each operator call.
During the profiling process, all time stamps are stored in logs for further reference. Each operator call is also recorded. Based on the log, one could do the analytical profiling to calculate how much each portion of the control-flow source code affects the total execution time.
Estimating dataflow execution time can be achieved by processing statistical data and logs. Since no dataflow-related parameters are logged, it is necessary to provide them based on the execution of the dataflow programs using the dataflow hardware. They can be empirically determined based on the execution of the Lattice-Boltzmann method.
The comparison of control-flow execution time to the estimated dataflow execution time is performed by measuring the time between the first and the last executed instruction using the control-flow processor. However, in some cases, the algorithm might include certain parts that is not eligible for dataflow hardware. It may take a considerable portion of the total execution time. For better comparison, the control-flow execution time of the portion of the algorithm that can run on the dataflow hardware can be estimated by taking into account the total number of instructions that have to be executed and the speed of the CPU. This would require the knowledge of how the control-flow algorithm will be divided into CPU cores. Therefore, the fair comparison would sum up the duration of each iteration. For connecting time stamps to the operations, a statistics class can be used. A user can call routines to define the unique identifier of any of the operator calls. The first instruction of an iteration can be determined as the first logged instruction. The last one is the one that precedes the following execution of the first instruction. The presence of a conditional statement might alter the last instruction, leading to several variants of the last instruction of the iteration. In any case, the duration of the iteration execution is calculated as a difference in time stamps between the end of the last and the beginning of the first iteration. The total time measurement duration has to be reduced from the total CPU execution time of the portion of the algorithm that can be accelerated using the dataflow hardware.
C + + profiler
Profiling a control-flow program written in C + + language can be achieved by logging arithmetic instructions as if the machine code was analyzed. Operator overloading in C + + offers the possibility to redefine the functionality of operators. The focus of this article is on binary operators, but the same logic applies to unary operators as well. Redefining operators for primary types are not allowed. The C + + language requires that operator overloads take at least one operand of a non-primitive type. This could cause difficulties in determining the operator that will be called. Also, while working with operations, it is beneficial to store the log and count how many times the operation was performed. As a result, we have implemented wrapper classes for primary types.
Storing statistical information is achieved by implementing the statistics class. In the example, only a few fields and methods will be shown. This includes a handle to the output stream where the log resides, a timestamp, and the operation code of the arithmetic operation. A programmer can mark the beginning of the execution of the portion of the algorithm that is suitable for the dataflow hardware by calling start_simulation() method, and the end of it by calling end_simulation() method. The get_id() method is used to obtain the ID of the instruction that will be executed right after. This is used for defining dependencies between instructions, where they cannot be determined automatically. Class Statistics is depicted in Fig. 5.
[See PDF for image]
Fig. 5
Class Statistics for storing profiling data
Let us consider the operator that works with two integer numbers. The class could be defined as in Fig. 6. Similar functionality can be implemented for other primitive types, e.g., unsigned int, size_t, float, double, bool, etc.
[See PDF for image]
Fig. 6
Class DF_int for profiling integer arithmetic instructions
This would unreasonably increase the size of the profile source code, but would also make the profiler error-prone, as the same or very similar source code would be used for all of the primitive types. Instead, we can define the template DF_var class for storing any of the primitive types. In Fig. 7, the most important parts of function definitions for manipulating statistics are also shown.
[See PDF for image]
Fig. 7
Template class DF_var for profiling all primitive types arithmetic instructions
The operator overload is depicted in Fig. 8. Instead of simply summing two numbers, additional statistical data can be provided as well, along with auxiliary functionalities. The following source code depicts a plus operator, but the logic is analogous to other binary operators.
[See PDF for image]
Fig. 8
Overloading an operator for profiling
Statistics enables counting the number of operations and summing the estimated duration of instructions, as well as hardware requirements based on cost functions.
The estimated speedup of the portion of the algorithm that is marked for execution using the dataflow hardware can be calculated as a rate between the estimated time that the dataflow hardware needs to execute it and the time CPU needed to execute the same code. The estimated speedup of the software must take into account the portion of the software that is not suitable for the dataflow hardware. Therefore, this estimated speedup is always smaller compared to the previously defined one.
Identifying machine instructions
One of the most challenging problems in profiling is the connection between the operator calls and the C + + statements that have caused them. This can be achieved by assigning each conditional statement a unique identifier and counting instructions that follow, as depicted in Fig. 9.
[See PDF for image]
Fig. 9
Example application profiling
The logging of conditional statements can be achieved by calling set_next_ID() method. The calls to the method can be placed automatically. One way to automatize this would be to create a parser for C + + code that would recognize conditional statements and automatically add a call with the next available unique identifier right after it. Another option is to perform this at the compiler level. The profiling can be performed on all conventional operating systems. In this work, Linux was used. The calls to set_next_ID() method are inserted automatically by parsing the source code. A simple Linux sed command can add calls without unique identifiers, and the vim editor supports creating macros with a couple of letters (e.g., Q for starting a macro and A as a macro name), where a previously used unique identifier can be copied and incremented using Ctrl + A shortcut. Both sed and Vim are supported by Windows and MacOS as well.
Instructions that follow a conditional statement instruction appear in the source code and in the execution phase as successive. Each non-conditional instruction can be identified by the unique identifier of the conditional statement and the order of instruction in a row of instructions that follow. In order for this to function properly, the counting has to reset once the iteration is finished, and the new one starts. Therefore, resetting the counter functionality has to be provided. Figure 10 depicts of the profiler methods for assigning global and local identifiers of an instruction. Beside the identification process, each operation is also stored in a vector of vectors of operations.
[See PDF for image]
Fig. 10
Assigning unique identifiers to arithmetic instructions
Experimental analysis of the proposed solution
The experimental analysis of the profiling is performed on the control-flow and the dataflow implementation of the Lattice-Boltzmann method [35]. The estimated speedup of the Lattice-Boltzmann algorithm using the dataflow hardware can be calculated as a rate between the time needed for the control-flow hardware to execute the algorithm, and the estimated time that the dataflow hardware needs to execute it and the time CPU needs to prepare the data for the dataflow hardware.
The results of this study will be presented chronologically. To begin with, the Lattice-Boltzmann implementation for the control-flow processor will be briefly presented, followed by the analysis of potentials for accelerating the Lattice-Boltzmann method using the dataflow hardware. Next, the dataflow implementation of the Lattice-Boltzmann method using the dataflow hardware will be presented. Finally, the results of the analysis will be given.
The Lattice-Boltzmann method
Figure 11 presents a compact form of the main loop of the Lattice-Boltzmann method, which is implemented in the C programming language for a flow control-based computer architecture. Certain functions are called over and over again, maxIter times in total, where maxIter is the number of iterations required for the simulation.
[See PDF for image]
Fig. 11
Critical code of the Lattice-Boltzmann method
The stream function requires all matrices to be fetched before the execution starts. However, it does not account for a considerable amount of total processing of data. The apply_BCs function is only responsible for updating edge elements. The collide function requires all matrices to be fetched, but unlike the stream function, this function has a significant amount of processing for each matrix coordinate. The functions stream and collide are part of the source code of the Lattice-Boltzmann implementation available on the Internet [35].
The source code of the function collide () of the Lattice-Boltzmann method is given in Appendix 1. The profiler can output various types of information about them, but detailed outputs are too large to fit within this article. The portion of the code from Appendix 1 produces 124 different operator calls.
Analysis of potentials for accelerating the Lattice-Boltzmann method
To begin with, an analysis of the potential for speeding up the Lattice-Boltzmann method using data flow-based principles will be given. Because technologies change, both in the case of control-flow processors and dataflow hardware, the dataflow hardware frequency is considered to be exactly 10 times lower than a control-flow processor. The transfer speed between the processor and the dataflow hardware is assumed to be equal to the speed of PCIe v4.0, which is 31,508 MB/s. It should be noted that this is also a decision and this speed will also be considered obsolete, but the analysis itself will not change until the corresponding two underlying hardware implementations change.
Amdahl's law states that the upper bound for the speedup for any application is given by 1 / (1—p), where p represents the portion of the application that can be executed in parallel. In our case, this means that only instructions executing on the dataflow hardware could be accelerated. Fortunately, almost all of the execution time of the Lattice-Boltzmann algorithm is spent on the main loop, resulting in high speedup potentials.
The core execution complexity of the dataflow hardware is O(1), because one input is sent to the dataflow hardware at each clock. On the other hand, a control-flow processor executes the algorithm in as many cycles as it takes to execute all the instructions that produce the same result as the dataflow hardware. Let's call this number n. Note that not all instructions are the same in the implementation of the algorithm for a control-flow processor and dataflow hardware. The complexity of the corresponding code on a control-flow processor is O(n), where n is the number of instructions. The advantages of a dataflow approach are already apparent. An estimate of the maximum acceleration that can be achieved will be given here. By analyzing the control-flow code, it is estimated that the number of processor cycles used to calculate the result using a single core is 190. This estimation is based on the assumption that the number of machine instructions is 140, and that executing add, sub, mul, and div take one clock cycle, while the average duration of the sqrt statement that calculates the value of the root of a number is 50 clock cycles. Pipelines and parallel cores are not assumed in the case of dataflow hardware. The maximum speedup that can be achieved under these assumptions is 190 divided by 10 since dataflow hardware has exactly 10 times lower clock frequency compared to the control-flow processor. However, the actual speedup is expected to be less in the case of implementing the Lattice-Boltzmann algorithm for dataflow hardware compared to the corresponding implementation for a control-flow processor. It should be noticed that the Lattice-Boltzmann algorithm is scalable, control-flow processors support multithreading, and dataflow hardware can fit more than one execution kernel, which makes parallelization on both types of processors possible.
Since the power consumption is almost proportional to the square of the frequency, the consumption of dataflow hardware has the potential to be 100 times lower under the same circumstances. This implies only consuming power on dataflow hardware. However, a computer with a dataflow card consumes almost twice as much electricity, which is to be expected, since it is necessary to power both the processor and the dataflow card. The total reduction in electrical energy can still be achieved with the dataflow paradigm, as the dataflow hardware can execute multiple instructions simultaneously. As a result, the overall execution time can be reduced, and the power consumption as well. In general, there are clusters of computers that contain dataflow cards, which greatly reduces the amount of electricity consumed by processors compared to the electricity required for dataflow cards.
By comparing the transfer speed of the dataflow hardware with the frequencies of today's processors, we can see that about 2.5 bytes can be transferred per processor clock cycle, or 25 bytes per cycle of the dataflow card. This means that about 6 floating point numbers can be transferred to the dataflow card in each cycle. This would cause the dataflow card to freeze. Fortunately, all the matrices can be transferred to the dataflow card before the calculation starts. Thus, the time required for the transfer is spent only once, which is comparable to the execution time of only a few iterations of the Lattice-Boltzmann algorithm using the control-flow processor.
Implementing each instruction of the algorithm directly in hardware instead of executing it on a control-flow processor leads to a reduction in the number of transistors that can be used to speed up the algorithm. However, it cannot be argued that this would lead to the reduction in power consumption, as the comparison of transistors in control-flow processors and those in dataflow hardware is not fair.
The Lattice-Boltzmann implementation for the dataflow hardware
This work is verified against the dataflow implementation of the Lattice-Boltzmann algorithm. To transform the control-flow version of the Lattice-Boltzmann algorithm into the dataflow implementation, an appropriate tool to automatically perform as much of the transformation of the algorithm as possible was needed. The following is a brief overview of the methods available in the field of systolic arrays.
The Cohen, Johnson, Weiser, and Davis method [36, 37, 38–39] maps mathematical expressions to systolic arrays. This is necessary for translating almost any application from a control-flow algorithm to a dataflow algorithm, but it is also included in most tools that can help with the translation process. The Lattice-Boltzmann algorithm includes a considerable number of arithmetic operations per each piece of data that needs to be moved to the dataflow hardware. The Lam and Mostow method (SYS) [40] also maps the algorithm for the systolic arrays, but in the case of this method, it starts from an algorithm suitable for translation to a data flow-based paradigm. As a tool needs appropriate markup to translate parts of an application to the dataflow hardware, most tools include libraries and commands that help programmers, where a programmer is responsible for indicating the parts that need to be translated to the dataflow algorithm or include methods for defining appropriate representations of algorithms for the dataflow architectures. The Gannon method [41] produces a functional specification based on an algorithm. In the case of the Lattice-Boltzmann method, a programmer must define the functionalities to be executed on the dataflow hardware. The H. T. Kung and Lin method [42] transforms a general mathematical expression into a pure algebraic expression. Further, it makes a specification of processing elements and defines time constraints. In the case of the Lattice-Boltzmann method, such an approach cannot be applied without major modifications, because translating only mathematical expressions into dataflow hardware would lead to excessive communication needs between the main memory and the dataflow hardware. The Kuhn method [43] and the Miranker and Winkler method [44] are most suitable for algorithms consisting of for loops with constant execution time and data dependencies in iterations. This is useful in the general case, but the Lattice-Boltzmann method requires much more complex methods to cover corner cases. More precisely, the Lattice-Boltzmann method requires different kinds of processing for the elements belonging to the edges of the matrices that it operates with. The Moldovan and Fortes method [45] produces an algebraic model of an algorithm from an application or a set of recurrent equations. However, this is not particularly useful in the process of translating the Lattice-Boltzmann method. The Lerner method optimizes the source code for propagating the results of instructions to the instructions that need these results.
As a tool, the Maxeler framework was chosen due to the following characteristics, taking into account the previously described methods:
- Automatic transformation of mathematical expressions,
- Automatic transformation of for loops with fixed ranges; It also supports loop unrolling,
- Automatic handling of communication between a control-flow processor and a dataflow hardware, and.
- Automatic discarding of unreachable code.
In addition to these, Maxeler framework offers the ability to write a program in a high-level programming language and supports invariant code motion, scalar variables, forward substitution, read-after-write elimination, and write-after-write elimination.
The Maxeler framework takes the input from the developer in the form of the kernel that will be responsible for processing matrix elements using the dataflow hardware, where data will be sent as the input to the kernel and another stream will be returned as the result of kernel processing. The kernel must have input and output parameters defined as hardware variables in MaxJ language by the programmer. Everything in between is almost the same as in the control-flow version of the algorithm written in Java language. All arithmetic operations are supported by the framework. The only limitation is that once a variable is saved as a hardware variable, it cannot be converted back to a Java variable. This is because the wire is not perfect, and voltages can fall anywhere in the possible range so the value cannot always be extracted accurately. We can look at it this way: a floating-point variable set to zero and then saved as a hardware variable is not ideal due to the physics of the conductive material. Therefore, the voltage on that wire will not be equal to zero, but close to zero, because of the noise on the wire. If we try to save the voltage as a Java variable, since the voltage is not exactly zero, the value might be different from the initial one. Allowing applications to save hardware variables back into the logical ones would introduce a new source of potential bugs. Also, the programmers cannot easily adapt to the inconsistencies caused by altering the value of a variable when copying the variable into the hardware variable and back to the logical one. Further, starting a deterministic application with the same input might not always give the same result. Therefore, the creator of the framework decided to disable this behavior. Data can be streamed back to the control-flow processor either from main memory or from dataflow hardware memory.
The specifics of the Maxeler framework will be described in the example of the Lattice-Boltzmann method. Most of the execution time spent executing the Lattice-Boltzmann algorithm on a processor is spent executing the collide function. Therefore, this function was first implemented as a kernel that would process all elements of the matrix, whereby the elements would be sent in the form of a stream from the processor to the dataflow card, and the result returned as a stream from the dataflow card to the main memory. In both control-flow and dataflow hardware implementations of the Lattice-Boltzmann algorithm, matrices are stored as vectors, with the first row of the matrix being stored at the very beginning of the vector, followed by the second row of the matrix, and so on. Figure 12 shows the core code responsible for updating the matrix elements, with some lines responsible for working with vectors 1 to 9 omitted for clarity and replaced by dots.
[See PDF for image]
Fig. 12
Lattice-Boltzmann collide kernel
Since the Lattice-Boltzmann method has for loops with constant execution time, it is possible to achieve what is implied in the Kuhn method and the Winkler method using the Maxeler framework and to implement functionalities for dataflow hardware. The Maxeler framework further performs a functional specification based on the dataflow kernel, as in the Gannon method. While the Cohen, Johnson, Weiser, and Davis method transforms mathematical expressions into the functionality of processing elements with the specification of time constraints, the Lam and Mostow method describes more closely what is achieved using the Maxeler framework, because in the case of the Maxeler framework, the kernel has to be written in the MaxJ programming language, so that it can be mapped onto the FPGAs. Unfortunately, while the Moldovan and Fortes method is capable of automatically deriving an algebraic model of an algorithm from an application or a set of recurrent equations, in the case of the Maxeler framework, it is necessary to think about dataflow graphs and take special caution on the element indices. The Maxeler framework is responsible for taking care of the functionalities defined in the H. T. Kung and Lin method. Using techniques available in compilers, signal propagation is optimized, as suggested in the Lerner method.
The stream kernel includes a significant amount of processing that must be performed on the matrices. Considering the advantages of the dataflow paradigm, including the execution of hundreds or thousands of instructions in parallel, without the need to retrieve input data for each instruction over the internal bus, but rather exchange the inputs and outputs of adjacent processing elements, it is to be expected that the kernel will execute the algorithm faster on the Maxeler card than on the control-flow processor. However, to speed up the execution of the algorithm, the remaining functions must also be executed on the Maxeler card.
The kernel corresponding to the stream function is given in Fig. 13. As in the case of the collide function, some redundant parts have been replaced with dots. The main reason for the long execution of the stream function is that the elements must be loaded. On the other hand, the processing that has to be done for each element is very simple. It should be noted that this kernel also includes the execution of instructions from the apply_BCs() function.
[See PDF for image]
Fig. 13
Implementation of the Lattice-Boltzmann stream function for a dataflow hardware
We have to admit that in addition to the great functionality provided by the tools and methods, the optimization of the algorithm execution on the dataflow architecture still depends on the developer. Figure 14 shows the main implementation principle of the Lattice-Boltzmann method on the selected architecture based on the dataflow. Variables i and j represent the current column and row respectively. To calculate the fluid parameters in an elementary small volume of space, the kernel must know the parameters of the surrounding elementary volumes. Therefore, the result of the kernel execution depends not only on the stream of elements of one column containing the parameters of the corresponding elementary volumes but also on the streams of two adjacent columns. In addition, each element from the stream in the middle will depend on its predecessor and successor, which can be obtained by applying the Maxeler framework that introduces indices to streams (in our case −1 and + 1, indicating the previous and next stream element respectively).
[See PDF for image]
Fig. 14
Processing elements based on the dataflow paradigm
Figure 15 depicts the purpose of the Lattice-Boltzmann method manager. The manager is responsible for interconnecting kernels, but also processors with kernels. Although the Maxeler framework automatically creates an application manager, any customization requires the developer to modify the automatically generated manager. The algorithm for the Lattice-Boltzmann method iterates through all elementary volumes many times (maxIter times). Only the result is what the user is interested in. Therefore, all processing can be done on dataflow hardware, since the dataflow hardware has enough memory to accommodate all the matrices needed to execute the algorithm. In this case, an important decision was to store all intermediate results in memory on the Maxeler card, to save on the time, it would take to transfer this data back and forth between the processor and the dataflow hardware. This means that the data should be fetched from the main memory at the very beginning, and the kernel input should be taken from the dataflow hardware memory and streamed during the execution of the iterations themselves. Similarly, the results of the iterations should be stored in the memory of the dataflow hardware. Only at the very end of the algorithm execution, it is necessary to transfer the data representing the results of the algorithm execution to the main memory. Therefore, in addition to defining the flows between the kernels and the processor, multiplexers had to be introduced, which would direct the data. Loading data from the processor through the so-called Peripheral Component Interconnect Express (PCIe) to the dataflow card is performed before the dataflow hardware starts processing. Further, it is necessary to load data from the DRAM memory that is on the dataflow card. Similarly, at the very end of the execution, it is necessary to return the result of the dataflow hardware to the main memory.
[See PDF for image]
Fig. 15
The Lattice-Boltzmann manager
The acceleration of the dataflow implementation of the Lattice-Boltzmann algorithm for various numbers of iterations is available in the open literature [35]. It is noticeable that the acceleration is higher in the case there are not so many loop iterations. This was expected, as the size of matrices is about 320 × 112 elements, which is a relatively small value, enabling all of the matrices to fit within the cache. Therefore, with a lower number of iterations, the penalty of initially bringing data into the cache is more noticeable. As can be seen from the graph, the acceleration of the dataflow implementation of the Lattice-Boltzmann method for the relative high number of iterations is on average around 17.
Results of the analysis
The proposed dataflow acceleration estimation method can be applied to any control-flow application. In this work, the acceleration possibilities will be estimated using the profiler on the Lattice-Boltzmann method. The dataflow implementation of this algorithm can be found in the open literature [35].
Estimating the acceleration of the Lattice-Boltzmann method using the dataflow hardware will be performed in two ways. The first one is the radical one, where the dataflow hardware will be considered as a sum of completely configured transistors solely for this purpose. This implies that it is not possible to utilize existing FPGA cards, but rather define the chip die from the ground up. The second one takes into account the conventional dataflow hardware that can be found on the market.
The purpose of the rough comparison is to show the theoretical potential of the dataflow hardware. The rough comparison is performed based on the following assumptions:
- 1 billion transistors for both the control-flow processor and the dataflow hardware.
- 64-bit operation takes 1000 transistors on average (around 15 per bit).
- CPU runs the application using solely one core without parallelization.
- Clock rate 1:10 in favor of the control-flow processor.
Based on these assumptions, the dataflow hardware capabilities are one million instructions per clock cycle, while the control-flow hardware capabilities are one instruction per clock cycle. Having only these parameters in the acceleration estimation process, one could expect a dataflow hardware acceleration factor of one hundred thousand. This demonstrates the dataflow hardware advantages.
The theoretical speedup has to be adjusted due to the algorithm properties. The dataflow implementation of the Lattice-Boltzmann method assumes two kernels that are similar in terms of processing, but the one that works with edge elements is utilized less than one percentage of the time so that the transistor count that is responsible for processing most of the matrix elements has to be reduced by half. Also, the dataflow hardware has to provide the memory for storing matrix elements. If the control-flow hardware has around one-half of the transistors dedicated solely to cache memories, the dataflow hardware with half of the transistors dedicated to DRAM memory would have the speedup reduced by a factor of two. Therefore, the corrected theoretical speedup drops to the 25,000. However, orchestrating the data requires resources, i.e., space. If we assume half of the chip die surface to be utilized for transferring intermediate results from one processing element to another, the theoretical speedup further drops by a factor of two.
If we assume that the cache speed of control-flow hardware is around three terabytes per second, and the frequency is three gigahertz, this would impose that a thousand bytes can be transferred per one clock cycle. For the dataflow hardware that has ten times lower frequency, this would translate to ten thousand bytes per second. Each float variable takes four bytes, and three vector elements have to be streamed for processing each matrix element. This leads to a maximum of eight hundred matrix element calculations per second. Since around ten matrices have to be fetched for a kernel to process them, we could expect that the acceleration factor is further reduced by the factor of ten.
Based on the profiling of the control-flow implementation of the Lattice-Boltzmann method, it can be concluded that during each dataflow hardware clock cycle, 124 instructions are executed. When we compare the memory bandwidth, the amount of data provided for the kernel at each clock cycle, and the number of instructions that are executed in parallel within the kernel, it appears that the memory bandwidth would be the bottleneck, resulting in total theoretical acceleration factor of around ten thousand.
The dataflow implementation of the Lattice-Boltzmann method was programmed for an existing dataflow hardware. The dataflow and control-flow hardware that was used for achieving the acceleration consisted of the desktop computer with an Intel i5 650 control-flow processor along with an MAX2 card with 6 GB of available RAM that can be accessed by multiple dataflow kernels in parallel. The control flow hardware consisted solely of an Intel i5 650 processor with 4 GB RAM at the speed of 1333 MHz.
Besides data availability, to have the dataflow hardware reconfigurable, the design has to be uniform. As a result, FPGAs consist of thousands of look-up tables (LUTs). The LUT is the basic building block capable of implementing any logic function of N Boolean variables. It is a table that determines what the output is, based on the inputs. The hardware implementation of a LUT consists of a collection of memory cells connected to multiplexers. A LUT can be used either as a function compute engine or a memory cell. The Maxeler MAX2 FPGA card used has two Xilinx Virtex-5 FPGAs. The Virtex-5 family provides up to 330,000 logic cells (LCs).
As the MAX2 card has two Virtex-5 FPGAs, this sums to 660,000 LCs. If we take into account the ten times lower frequency of the dataflow hardware compared to the control-flow processor, this would correspond to 66,000 LCs in terms of calculations per second with the control-flow processor frequency. Assuming that around one hundred LCs on average are needed for implementing a 64-bit arithmetic instruction, this would correspond to 660 operations per second.
A theoretical acceleration of the dataflow hardware is above one hundred. This acceleration can be expected in the case the data cannot fit into the cache memory of the control-flow processor. In the case of the matrix dimensions of 320 × 200, the data can fit within a cache memory. As a result, the acceleration factor was higher for the case where the data was not already loaded into the cache memory of the control-flow processor. However, once the data is in the cache memory the achieved speed-up factor starts dropping with an increasing number of iterations. It should be noted that the acceleration factors from the graph were calculated without taking into account loading the data onto the dataflow card so that only calculation capabilities can be compared.
By dividing 660 by 100, we can notice the gap between the achieved speedup on the real dataflow hardware and the theoretical maximum. This factor of 6.6 relates to the instruction execution constraints. As already said, around half of the logic cells remain unused for more than 99% of the time, as there is a kernel that is responsible solely for edge elements. This leads to the possibility of accelerating the algorithm further by a factor of 3.3. This gap between the theoretical maximum calculated here and the actual acceleration that has been achieved can be explained as a result of multiple factors. The data dependencies cause constraints in terms of the maximal frequency on which the dataflow hardware can operate. Therefore, the dataflow hardware frequency has to be reduced for certain kernels. The implementation of the algorithm is not unique. For example, the compiler can decide whether loop unrolling will lead to better performances. This can drastically increase in the number of instructions but still lead to better acceleration. The framework for mapping the MaxJ kernel to the logic cells is not perfect as well.
It is important to notice that the results of the analysis are based on the problem size for which the profiling was started. This has benefits in terms of estimating the acceleration directly on the problem for which the dataflow implementation needs to be programmed. On the other hand, the estimated speedup might not correlate to the one that would be measured for different problem sizes. This alone might be enough benefit for dataflow software programmers, as it lets them estimate how the acceleration factor depends on the problem size, and if it is worth programming.
Another benefit of the proposed approach to estimating the speedup of dataflow implementation of software is that the profiling can be done before the whole algorithm is executed. This constraint has already been studied [46, 47–48].
From the user point of view, there is a need to estimate manually which portion of the algorithm can be accelerated using the dataflow hardware, and to mark it using methods start_simulation() and end_simulation() from the Statistics class. The user is also responsible for defining dependencies if they cannot be determined automatically. Having in mind that the profiling can be done in the early execution stage of the control-flow algorithm, there is a possibility for the profiler to be utilized for determining which portions of the control-flow algorithm should be moved to the dataflow hardware, and what would be the acceleration factor for translating various portions of the control-flow algorithm onto the dataflow hardware.
On the validity of results
The main threat to the validity appears to be the fact that the experimentally estimated parameters for some dataflow algorithms cannot be applied to any dataflow algorithm. The number of resources that will be utilized can be determined only once the dataflow has been configured, or at least simulated. In this work, profiling C + + programs is presented. In this approach, the output of the profiler doesn’t include the exact amount of dataflow hardware resources that will be utilized. However, based on the number of implementations of dataflow algorithms and the number of different types of instructions that the dataflow implementation of the algorithm should execute, there is a possibility to predict the requirements for specific types of dataflow hardware resources. The dataflow parallelism potentials depend on the portion of occupied dataflow hardware resources. In this work, the estimated amounts of dataflow resources are increased before calculating the parallelism for two reasons: to be rigorous, and to take into account the control logic necessary for running instructions in parallel.
Estimating dataflow hardware frequency before configuring the dataflow hardware is prone to errors. However, based on the instructions that have to be executed, and the dependencies between input and output data, the timing constraints can be predicted, as is the case with the Maxeler framework for simulating and configuring dataflow hardware. The frequency of the dataflow hardware is set to the minimal value so that each instruction starts processing only after all required inputs are provided.
Conclusions
Both control-flow and dataflow computer architectures are capable of executing the portion of an algorithm suitable for the dataflow computer architectures. The dataflow can accelerate the execution of an algorithm only under certain conditions. The size of the problem has to be big enough to cover the disadvantages of the dataflow hardware. Also, the instructions that should be continuously executed should be mutually independent. This means that there should be enough calculations to be performed in parallel. Programming dataflow hardware is known to require more effort than programming control-flow hardware. Usually, dataflow hardware is programmed by modifying or re-implementing control-flow algorithms. It is beneficial if a programmer can know in advance the expected acceleration of the algorithm using the dataflow hardware.
The presented approach to estimating the performance benefits of converting a control-flow algorithm into the dataflow algorithm requires not much more than defining in high-level language what the dataflow hardware should do. For this purpose, a new profiling tool is introduced that enables modifying the C + + source code so that the performance of the dataflow hardware implementation of the algorithm can be estimated.
Acknowledgements
Authors are thankful for the discussion about related topics with Filip Živković.
Author contributions
All authors worked on publishing this articles.
Funding
Not applicable.
Availability of data and materials
No datasets were generated or analysed during the current study.
Declarations
Ethics approval and consent to participate.
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Abbreviations
Arithmetic logic unit
Central processing unit
Field-programmable gate array
Graphics processing unit
High-performance computing
Integrated development environment
Logic cell
Look-up table
Program counter
Peripheral component interconnect express
Processing element
VHSIC hardware description language
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Milutinović, V; Salom, J; Trifunovic, N; Giorgi, R. The dataflow paradigm. Guide to dataflow supercomputing; 2015; Cham, Springer: pp. 1-39.
2. A.R. Hurson, and M.K. Krishna. Dataflow computers their history and future. Wiley Encyclopedia of Computer Science and Engineering. 2007.
3. Arvind, C. Dataflow architectures. Ann Rev Comput Sci; 1986; [DOI: https://dx.doi.org/10.1146/annurev.cs.01.060186.001301]
4. Veen, AH. Dataflow machine architecture. ACM Comput Surv; 1986; 18,
5. Lee, B; Hurson, AR. Issues in dataflow computing. Adv Comput; 1993; 37, pp. 285-333.
6. Lee, B; Hurson, AR. Dataflow architectures and multithreading. Computer; 1994; 8, pp. 27-39.
7. Godfrey, MD; Hendry, DF. The computer as von Neumann planned it. IEEE Ann Hist Comput; 1993; 15,
8. Milenkovic, A; Milutinovic, V. Cache injection: a novel technique for tolerating memory latency in bus-based SMPs. European conference on parallel processing; 2000; Springer, Berlin:
9. Grujic, A; Tomasevíc, M; Milutinovic, V. A simulation study of hardware-oriented DSM approaches. IEEE Parallel Distrib Technol; 1996; 4,
10. Kwak, H; Lee, B; Hurson, AR; Yoon, SH; Hahn, WJ. Effects of multithreading on cache performance. IEEE Trans Comput; 1999; 48,
11 Milutinovic, D; Milutinovic, V; Soucek, B. The honeycomb architecture. Computer; 1987; 20, pp. 81-83.
12. Flynn, MJ; Mencer, O; Milutinovic, V; Rakocevic, G; Stenstrom, P; Trobec, R; Valero, M. Moving from petaflops to petadata. Commun ACM; 2013; 56,
13 Pell, O; Mencer, O; Tsoi, KH; Luk, W. Vanderbauwhede, W; Benkrid, K. Maximum performance computing with dataflow engines. High-performance computing using FPGAs; 2013; Cham, Springer:
14. Johnston, WM; Hanna, JRP; Millar, RJ. Advances in dataflow programming languages. ACM Comput Surv; 2004; 36,
15. Chen, S; Doolen, GD. Lattice Boltzmann method for fluid flows. Annu Rev Fluid Mech; 1998; 30, pp. 329-64.16096061398.76180
16 Trifunovic, N; Perovic, B; Trifunovic, P; Babovic, Z; Hurson, AR. A novel infrastructure for synergistic dataflow research, development, education, and deployment: the Maxeler AppGallery project. Advances in Computers; 2017; Amsterdam, Elsevier: pp. 167-213.
17. S. Stojanović, D. Bojić, and V. Milutinović, Solving Gross Pitaevskii equation using dataflow paradigm, The IPSI BgD Transactions on Internet Research; 2013. 1–17.
18 Korolija, N; Popović, J; Cvetanović, M; Bojović, M. Dataflow-based parallelization of control-flow algorithms. Advances in computers; 2017; Amsterdam, Elsevier: pp. 73-124.
19. Kos, A; Rankovic, V; Tomazic, S. Sorting networks on Maxeler dataflow supercomputing systems. Adv Comput; 2015; 96, pp. 139-186.
20. Bezanic, N; Popovic-Bozovic, J; Milutinovic, V; Popovic, I. Implementation of the RSA algorithm on a dataflow architecture. Trans Int Res; 2013; 9,
21. Stanojevic, I; Senk, V; Milutinovic, V. Application of Maxeler dataflow super-computing to spherical code design. Trans Int Res; 2013; 9,
22. V. Milutinović, N. Trifunović, N. Korolija, J. Popović, and D. Bojić, Accelerating program execution using hybrid control flow and dataflow architectures, In Telecommunication Forum (Telfor), 2017 IEEE; 2017.pp 1–4.
23. D. Miladinović, M. Bojović, V. Jelisavčić, and N. Korolija, Hybrid Manycore Dataflow Processor, in Proceedings of the IX International Conference IcETRAN, Novi Pazar, Serbia; 2022. pp 6–9.
24. V. Milutinović, E. S. Azer, K. Yoshimoto, G. Klimeck, M. Djordjevic, M. Kotlar, .. & I. Ratkovic, The ultimate dataflow for ultimate supercomputers-on-a-chip, for scientific computing, geo physics, complex mathematics, and information processing, In 2021 10th Mediterranean Conference on Embedded Computing (MECO), IEEE, pp. 1–6.
25. V. Milutinović, M. Kotlar, I. Ratković, N. Korolija, M. Djordjevic, K. Yoshimoto, and M. Valero, The ultimate data flow for ultimate super computers-on-a-chip, In Handbook of Research on Methodologies and Applications of Supercomputing, IGI Global; 2021. pp. 312–318.
26. Blagojević, V; Dragan, D; Bojović, M; Cvetanović, M; Đorđević, J; Đurđević, Đ; Furlan, B et al. A systematic approach to generation of new ideas for PhD research in computing. Advances in computers; 2017; Amsterdam, Elsevier: pp. 1-31.
27. Popovic, J; Bojic, D; Korolija, N. Analysis of task effort estimation accuracy based on use case point size. IET Software; 2015; 9,
28. Huang, K; Liu, Y; Korolija, N; Carulli, JM; Makris, Y. Recycled IC detection based on statistical methods. IEEE Trans Comput Aided Des Integr Circuits Syst; 2015; 34,
29. Ball, T; Larus, JR. Optimally profiling and tracing programs. ACM Trans Program Lang Syst; 1994; 16,
30. Spoonhower, D; Blelloch, GE; Harper, R; Gibbons, PB. Space profiling for parallel functional programs. ACM Sigplan Notices; 2008; 43,
31. T. Ball, and J. R. Larus, Efficient path profiling, In Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29, IEEE, December 1996, pp. 46–57.
32. Graham, SL; Kessler, PB; McKusick, MK. Gprof: A call graph execution profiler. ACM Sigplan Notices; 1982; 17,
33. Herodotou, H; Babu, S. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. Proc VLDB Endow; 2011; 4,
34. F. Gruber, M. Selva, D. Sampaio, C. Guillon, A. Moynault, L. N. Pouchet, and F. Rastello, Data-flow/dependence profiling for structured transformations, In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming; 2019, pp. 173–185.
35. Korolija, N; Djukic, T; Milutinovic, V; Filipovic, N. Accelerating Lattice-Boltzman method using Maxeler dataflow approach. IPSI BgD Trans Int Res; 2013; 9,
36. Kirk, D. NVIDIA CUDA software and GPU parallel computing architecture. ISMM; 2007; 7, pp. 103-104.
37. R. P. Feynman, Lectures on Computation, The ACM Digital Library, 1998.
38. Olukotun, K; Hammond, L. The Future of Microprocessors. ACM Queue; 2005; 3,
39. Esmaeilzadeh, H; Blem, E; Amant, R; Sankaralingam, K; Burger, D. Power challenges may end the multicore era. Commun ACM; 2013; 56,
40. Yang, C; Wu, H; Huang, Q; Li, Z; Li, J. Using spatial principles to optimize distributed computing for enabling the physical science discoveries. Proc Natl Acad Sci; 2011; 108, 1.
41. Pellerin, D; Tibault, S. Practical FPGA Programming in C; 2005; USA, Prentice Hall Press:
42. D. Cohen, Mathematical Approach to Iterative Computation Networks, Proceedings of Fourth Symposium on Computer Arithmetic, IEEE; 1978, pp. 226–238.
43. Johnsson, L; Cohen, D. Mathematical approach to modeling the flow of data and control in computational networks, VLSI systems and computations computer science; 1981; Cham, Springer:
44 Weiser, U; Davis, A. A wavefront notion tool for VLSI Array Design VLSI systems and computations computer science; 1981; Cham, Springer:
45 Fortes, JAB; Moldovan, DI. Parallelism detection and transformation techniques useful for VLSI algorithms. J Parallel Distrib Comput; 1985; [DOI: https://dx.doi.org/10.1016/0743-7315(85)90029-2]
46. Milutinovic, V; Salom, J; Veljovic, D; Korolija, N; Markovic, D; Petrovic, L. Transforming applications from the control flow to the dataflow paradigm Dataflow supercomputing essentials; 2017; Cham, Springer:
47 Korolija, N; Popović, J; Bojovi, MM. Introduction to dataflow computing, in handbook of research on methodologies and applications of supercomputing; 2021; Pennsylvania, IGI Global:
48 Korolija, N; Trobec, R. An overview of selected dataflow applications in physics simulations exploring the dataflow supercomputing paradigm; 2019; Cham, Springer:
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.