Content area
Issue Title: Special Issue on On-chip Parallel and Network-Based Systems
NoC technology is composed of packet-based interconnections, where the communication resources are distributed across the network. Therefore, the optimal resource utilization is a crucial consideration for efficient architectural designs. This paper studies the practicality of the Constraint Programming (CP) models for NoC architecture designs that effectively use a regular mesh with wormhole switching and the XY routing. The complexity of the CP models is compared with the earlier Mixed Integer Programming (MIP) models. Practical CP-based mapping and scheduling models are developed and results are reported on the benchmark datasets. Results indicate that mapping and scheduling problems can be solved at near optimality even under relatively shorter run-time limits as compared to those required by the MIP models.
http://crossmark.crossref.org/dialog/?doi=10.1007/s00607-013-0359-4&domain=pdf
Web End = http://crossmark.crossref.org/dialog/?doi=10.1007/s00607-013-0359-4&domain=pdf
Web End = Computing (2015) 97:579592
DOI 10.1007/s00607-013-0359-4
Received: 15 April 2013 / Accepted: 30 September 2013 / Published online: 11 October 2013 Springer-Verlag Wien 2013
Abstract NoC technology is composed of packet-based interconnections, where the communication resources are distributed across the network. Therefore, the optimal resource utilization is a crucial consideration for efcient architectural designs. This paper studies the practicality of the Constraint Programming (CP) models for NoC architecture designs that effectively use a regular mesh with wormhole switching and the XY routing. The complexity of the CP models is compared with the earlier Mixed Integer Programming (MIP) models. Practical CP-based mapping and scheduling models are developed and results are reported on the benchmark datasets. Results indicate that mapping and scheduling problems can be solved at near optimality even under relatively shorter run-time limits as compared to those required by the MIP models.
Keywords Constraint programming Application mapping Scheduling
Network-on-chip
Mathematics Subject Classication 90C27 47N10 68M20 47N70
A. Demiriz (B)
Sakarya University, Sakarya 54187, Turkey e-mail: [email protected]
N. Bagherzadeh
Center for Pervasive Communications and Computing (CPCC), University of California Irvine, Irvine, CA 92697, USA e-mail: [email protected]
A. Alhussein
King Abdulaziz City for Science and Technology, Riyadh 92697, Saudi Arabia e-mail: [email protected]
http://crossmark.crossref.org/dialog/?doi=10.1007/s00607-013-0359-4&domain=pdf
Web End = http://crossmark.crossref.org/dialog/?doi=10.1007/s00607-013-0359-4&domain=pdf
Web End = http://crossmark.crossref.org/dialog/?doi=10.1007/s00607-013-0359-4&domain=pdf
Web End = http://crossmark.crossref.org/dialog/?doi=10.1007/s00607-013-0359-4&domain=pdf
Web End = http://crossmark.crossref.org/dialog/?doi=10.1007/s00607-013-0359-4&domain=pdf
Web End = http://crossmark.crossref.org/dialog/?doi=10.1007/s00607-013-0359-4&domain=pdf
Web End = http://crossmark.crossref.org/dialog/?doi=10.1007/s00607-013-0359-4&domain=pdf
Web End = Using constraint programming for the design of network-on-chip architectures
Ayhan Demiriz Nader Bagherzadeh
Abdulaziz Alhussein
123
580 A. Demiriz et al.
1 Introduction
High-performance system-on-chip (SoC) and chip multi-processors entail an increase in the number of required processor elements (PEs). However, the communication performance between these PEs is degraded due to the bus-system limitations. Shared buses become bottlenecks for system communications as the number of PEs increases. Network-on-chip (NoC) is a new technology developed to overcome limitations of the classical bus and provide a scalable design for 100s of cores [3]. NoC technology is based on the packet switched interconnections, where the communication resources are distributed. PEs communicate and exchange data via packets using these shared resources (links, buffers, switches, etc.). Each PE is attached to a router that connect the PE to the network. Packets are transferred among routers according to such routing algorithms where the packet ow and directions are dened. PEs are placed and organized in a topology which has certain characteristic such as wires length, average number of communication hop and degree of ports. Network topology includes mesh, torus, star and others.
While routing algorithms and network topologies inuence the performance of NoCs, mapping applications to the NoC architecture is the key for the overall performance. Mapping tasks to the PEs and optimizing NoC trafc among them determine the actual performance metrics in terms of speed-up, power consumption, and the degree of parallelization.
Mapping facilitates assigning and placing the IP cores onto PEs such that an objective function (metric of interest) is minimized where a particular topology and corresponding task assignments are given beforehand [13]. Mapping can be considered as a sub-problem of oor-planning where nding the optimal topology is also considered as a part of the problem. However, oor-planning and mapping are practically equivalent in a regular mesh topology with identical PEs. In the rest of the paper, oor-planning and mapping are used interchangeably. Mapping is related to the quadratic assignment problem (QAP) [11,19] which can be posed as mixed-integer programming (MIP) model and was shown to be NP-hard [9]. The QAP can generally be dened as assigning facilities to the locations in order to minimize the total cost of owing goods among these locations.
This paper extends our earlier work in [4] and explores the applicability of using the constraint programming (CP) models in nding solutions to the mapping and application scheduling problems. CP plays an important modeling role as being at the intersection of traditional mathematical programming and articial intelligence (AI) [14]. The ability of utilizing intuitive and powerful optimization models that borrow modeling ideas from mathematical programming and powerful search strategies from AI makes the CP paradigm a good candidate to design efcient NoC architectures. Instead of complicated MIP models used in NoC synthesis earlier in [7,9] that do not necessarily return a solution within the run-time limits, we propose very intuitive and efcient CP models which can achieve optimal or near-optimal solutions within relatively very short run-time limits.
Contributions of this paper are the following: Application mapping and scheduling problems are successfully formulated and modeled by the CP approach.
123
Using CP for designing NoC 581
Wormhole switching is incorporated to the CP models to nd the optimal application scheduling solutions based on deterministic XY routing.
Fifteen different benchmark datasets are generated by varying the number of tasks, the number of links, and the communication volume to study the behavior of the CP models under random realizations of the task assignments.
Our approach has been tested on real application benchmark datasets provided in [10] and it has also been compared to MIP model for solving the mapping problem.
Packet latency is studied under deterministic conditions to observe its behavior by changing the task complexities.
The remainder of this paper is organized as follows. Section 2 introduces the preliminaries of NoC platform and the related literature on the mapping problems. In Sect. 3, the problem formulation and the CP-based model are given to present the advantages of our approach. Section 4 contains information on the experimental setup, further details of the CP models, and discussions of the results. Section 5 is the main extension in this paper over our work in [4]. In this section, we give results from real application benchmark datasets and we practically show that our approach is scalable for the real application data on 8 8 architecture. The paper is concluded and future directions
are highlighted in Sect. 6.
2 Preliminaries and previous work
2.1 Network-on-chip platform
PEs are organized into 2D mesh network where each PEs router has four output ports (West, East, North and South) and one internal port connect to the PE. Deterministic routing is adopted; in particular, dimension order routing (DOR) is used. Packets are divided into multiple its featuring 64-bit width, one header its and maximum of 64 body its. Section II.A of [4] contains more details about the NoC platform.
The header its require extra cycles to be transmitted. One clock cycle is consumed by the header processing unit (HPU) of the router to process the destination address. Another clock cycle is consumed by the arbiter to arbitrate between different inputs. Moreover, it transmission to the next router takes one cycle. Body its require only one clock cycle to be transmitted to the next router.
2.2 Power model
Power consumption in semiconductor technologies consists of static and dynamic power. Static power consumption becomes a critical factor in the total power as the transistors become smaller and faster. The leakage current is a signicant factor when the technology is scaled down. On the other hand, dynamic power consumption is produced by the switching activities in the semiconductor circuits.
The energy consumption analysis is presented in Section II.B of [4]. It has been shown header its consume more power because of the header processing and arbitration circuits. EH f lit and EB f lit represent the energy consumption of header and data (body) its respectively and furthermore can be expressed more by the following:
123
582 A. Demiriz et al.
EH f lit = Ebuf f er + Ecrossbar + Earbiter (1) EB f lit = Ebuf f er + Ecrossbar (2)
Dynamic power consumption can be reduced when the energy of the packet is minimized. This can be achieved by minimizing the packets communication energy between PEs. Mapping of the tasks plays an essential role of determining the packet traveling paths which reduces the number of hops. On the other hand, tasks scheduling affects the buffering requirements of packets along with the travel path.
2.3 Previous work
A branch-and-bound algorithm is proposed in [9] to nd near optimal mapping solutions that are energy aware and satisfy the bandwidth constraints under some performance constraints. The proposed mapping algorithm in [9] optimizes the energy requirements based on the random assignment of tasks to the IP cores. So the mapping problem is to place IP cores in appropriate locations on the regular grid in order to minimize energy based on the network trafc [9]. Authors in [18] propose a comprehensive two-stage NoC synthesis model by utilizing the MIP. In the rst stage, an energy efcient system-level oor-planning is achieved through MIP. The second stage is conducted for a detailed routing functionality. At stage two, placement of routers is optimized to enable the trafc ow. The MIP model is very complicated in [18] and it often does not return a solution within the run-time limits. So a clustering-based heuristic is proposed to address the complexity issue of the second stage. It should be noted that if a certain level of the problem abstraction is not applied appropriately in the MIP models, it is very likely that the MIP models will not be able to return a solution within the run-time limits due to complexity issues. From the earlier work, it is evident that the MIP implementations may suffer gravely from two shortcomings: misjudging the level of problem abstraction and the difculty in accurate modeling.
The CP paradigm is used in [2,16] on scheduling problems of multi-task application on multi-processor system-on-chip (MPSoC) after using the MIP models for nding allocation solutions. Thus, mapping and scheduling problems are effectively decomposed [2,16] into two sub-problems and solved in tandem like in our implementation. Computational efciency of the decomposition method was shown in [16] in comparison to the full optimization models. Therefore, oor-planning and application scheduling tasks are conducted in two different stages as a result of decomposition. However, we utilize the CP approach for both mapping and scheduling problems rather than using the MIP for the mapping stage and the CP for the scheduling stage in contrast with [16]. Another difference in our implementation is that data communication is modeled at packet level, therefore scheduling has been conducted in much ner detail compared to [16]. On the other hand, [16] runs mapping and scheduling successively in an iterative manner to nd a converged solution. Reference [12] which is about CP and MIP implementations of general allocation and scheduling problems provides a relevant survey paper which shows how practical to solve both problems in two stages.
123
Using CP for designing NoC 583
3 Motivation and problem formulation
3.1 Basic assumptions and overview
Assume a set of PEs, organized as a 2-D mesh of dimensions n = m m.
Since all of the PEs are identical, the architecture is homogeneous. Each PE can be labeled according to its position on the mesh (as an x and y coordinate pair) and has the capability of executing several application tasks in tandem. Because all the PEs on the mesh are identical, there is no difference in the task execution time.
The task set is represented by an annotated task graph (TG), e.g. the one shown in Figure 2 of [4] which depicts a sample task graph generated by TGFF [5]. Each node in the graph represents a task of the application with its execution time given in the parentheses. Communication requirements (its) between tasks are shown on the edges (links). TGFF was used in generating several benchmark task graphs that have been used in the literature before (see [6,18]).
The cost of transferring data from one PE to another is a function of the routing algorithm. Deterministic routing algorithms can easily be incorporated to the optimization problem on hand and can help in nding the optimal application task schedule. The transfer cost based on the XY routing algorithm is proportional to the Manhattan distance which can be calculated between points a and b on a grid as: TCab = |xa xb| + |ya yb|.
The buffer size is assumed to be unconstrained. Any communication link can only be occupied by a single packet at any given time without any constraint on the bandwidth size. The communication links are bi-directional. In other words, any particular link between two routers can be considered as a separate link (resource) in each direction.
3.2 Problem formulation
The basic mapping problem is an instance of the QAP and can be formulated as a mathematical program model given in Eq. 3 where the decision variable xi j, i, j =
1, . . . , n is a Boolean variable, f and d are ow and distance matrices respectively. In general, QAP instances that have size of n > 30 cannot be solved in a reasonable time [11].
minimize
xX
(3)
n
[summationdisplay]
i, j,k,l=1
fikdjl xi j xkl
n
[summationdisplay]
j=1
xi j = 1, i = 1, . . . , n,
s.t.
n
[summationdisplay]
i=1
xi j = 1, j = 1, . . . , n,
xi j {0, 1}, i, j = 1, . . . , n.
123
584 A. Demiriz et al.
where the decision variable, xi j simply determines whether goods (or information in our context) should be sent from node i (P Ei) to node j (P E j ). The parameter d is the Manhattan distance between all the possible pair combinations of the PEs on the mesh. Parameter f represents the information to be sent between each PE. They are both constants throughout the optimization.
The objective function in Eq. 3 is a representation of the dynamic power and it can be simplied by introducing a permutation variable, , as below [11]:
minimize
where i = j, if xi j = 1 for i, j = 1, . . . , n. Introducing the permutation variable
enables us to use the CP by invoking alldifferent constraint which species the values assigned to the decision variables that must be pairwise distinct [8] on the permutation variable, as in Eq. 4. In other words, each of the n variables should be assigned a unique value between 1 and n.
minimize
As given in Eq. 4, the QAP can be modeled in a very intuitive way by utilizing the CP approach. Instead of n2 Boolean decision variables and 2n constraints besides 0-1 integrality constraints in Eq. 3, there are only n decision (permutation) variables and a single alldifferent constraint besides the integrality constraints. Aside from the simplicity of the model representation, the central strength of the CP approach is to construct efcient propagation search techniques to detect the dead-ends on the search tree as early as possible and prune them [8]. This approach depends on the efcient ltering of the domain which is dened as a nite set of elements that can be assigned to a decision variable. Efcient arc consistency algorithms exist to nd solutions that satisfy the alldifferent constraint. In general, alldifferent can be checked for consistency, i.e. to be determined to have a feasible solution, in O(zn) time [8] where n is the number of decision variables and z is the sum of cardinalities of each domain that belongs to a decision variable. Considering that z = n2 in our problem, the
complexity of consistency checking of alldifferent constraint is then O(n5).
Once the solution to the oor-planning problem determines the position of the PEs on the mesh, the application task scheduling problem can be posed as a separate mathematical programming model. Recall that all the PEs are identical and the processing times for a given task are equal for all the PEs. Hence, the optimal solution found in Stage I can be considered as the best oor-planning assignment that does not consider the bandwidth constraints. In other words, it is a relaxed solution without the bandwidth constraints. The application scheduling problem in Stage II becomes challenging when the bandwidth constraints are introduced. The new scheduling problem
n
[summationdisplay]
i, j=1
fi jdij
n
[summationdisplay]
i, j=1
fi jdij
(4)
s.t. alldifferent(1, . . . , n),
i {1, . . . , n}, i = 1, . . . , n.
123
Using CP for designing NoC 585
can be posed as a constraint programming model by letting appropriate decision variables to represent start time, end time, and processing time of tasks. The advantages of using CP in scheduling are twofold [1]:
Natural and exible way of modeling the scheduling problems by the CP. Efcient temporal and resource constraints.
For example, precedence constraints can be represented by endBeforeStart, endBeforeEnd, endAtStart, and endAtEnd constraints easily. An endBefore Start constraint requires a job to end before the other job starts. In other words, given tasks i and j, endBeforeStart constraint means end(i) start( j).
Similarly, an endAtStart constraint requires a job to end at the start of the other job. A noOverlap constraint can be used to schedule the tasks that utilize certain resources. Thus, precedence constraints can be constructed from the task graph. Communication links between routers can be considered as resources to model bandwidth requirements. We can then easily use noOverlap constraint to nd a schedule of transmission tasks through a particular communication link without violating precedence and resource availability constraints.
4 Experimental study
This section illustrates the applicability of our constraint programming based approach on randomly generated benchmark datasets.1 Throughout the experiments, we assume that channels can hold 64-bit its, the packet size is 64 its, and a 1,000 MHz architecture utilized. Tasks execution time and packets transmission delay are measured in clock cycles and therefore clock operating frequency does not inuence the mapping and scheduling performance. The header its require extra time of three clock cycles. The CP models are implemented by using IBM ILOG OPL Studio, which is available free of charge to the academicians at IBM Academic Initiative web site.2 Interested reader may consult with Section IV of [4] for further details of experimental setup.
4.1 Implementation details of the proposed model
There are practically two separate CP models, one for each stage. Stage I is primarily composed of the CP model given in Eq. 4. Since 3 3 and 4 4 regular meshes
are used in our implementation, there are only nine and sixteen decision variables respectively for the CP model in stage I and the domain (D) of these variables are
{1, . . . , 9} and {1, . . . , 16} respectively. There are two main input parameters for this
stage besides the random assignment of the tasks to PEs: namely ow ( f ) and distance (d) matrices. The distance matrix, d, represents the transfer cost T C for the XY-routing,i.e. Manhattan distances between the nodes on meshes which are 33 and 44 in our
implementation. The ow matrix, f , expresses the total communication requirements
1 All the datasets and optimization models can be downloaded at http://tinyurl.com/cdq5l9n
Web End =http://tinyurl.com/cdq5l9n .
2 http://tinyurl.com/cu5txlg
Web End =http://tinyurl.com/cu5txlg .
123
586 A. Demiriz et al.
(data its) between each task given on the edge of task graphs (see Figure 2 of [4]) with additional header it overhead for each data packet transferred.
The mapping problem is poorly-dened without randomly assigning tasks to the PEs rst. Otherwise, it is always feasible to assign all of the tasks to a single PE which will result in 0 energy consumption i.e. null solution for all the data communications of tasks. The random assignment is devised in such a way to guarantee the assignment of at least one task to each PE. However, each PE is expected to serve for multiple tasks as equally likely.
After producing an assignment solution at Stage I by solving CP model given in Eq. 4, we can now schedule all the tasks and their corresponding communication tasks as well. The second stage is more complicated than the rst one in terms of the modeling as the detailed implementation of the routing algorithm is required. The objective of Stage II can be set to minimizing the maximum completion time (make-span) among all the tasks. To enable an easier scheduling implementation, one can consider communication tasks among PEs at a very detailed level to generate precedence constraints appropriately. The reason for this is primarily the implementation of wormhole switching algorithm with XY routing.
We propose the creation of a data structure at it type level given in Figure 3 of [4] to manage this implementation. Benets and implementations of this data structure can be found in Section IV. B of [4]. By carefully crafting the data structure given in Figure 3 of [4], the precedence constraints which are used to implement the wormhole switching and the XY routing can be easily devised in order to nd a solution to the application task scheduling. As mentioned in Section 3 endBeforeStart, endBeforeEnd, endAtStart and endAtEnd constraints can accordingly be used in Stage II CP model. Notice that the physical links between routers should be treated as resources in the application task scheduling problem. When we consider these links as resources, it is easy to implement the bandwidth constraints by noOverlap constraints. Notice that we do not have a bandwidth size limitation in our implementation, however, our formulation implicitly constrains the bandwidth size as equal to the packet size. Therefore, only one packet can be served at any given time for a particular link (physical link) on the network. All the constraint types used in this study including alldifferentcan be implemented in any modern CP language without any difculty. IBM ILOG solver, our choice of solver, can handle such constraints efciently as being the best commercial solver available.
The run-time limits for Stage I and Stage II CP Models are both set to 1000 seconds total CPU time in our experiments. Considering that four processors are set to be used by the IBM ILOG solver, CP models return optimal or the best solution within 250 s at each stage. This is the way CPU time is controlled by solver, therefore, there is no need to report the actual run times.
4.2 Results
We conducted the experiments for fty random realizations of the instances that are summarized in Table 1 of [4]. A random realization is essentially a set of random assignments of the tasks to the PEs. All of the experiments ran successfully
123
Using CP for designing NoC 587
and returned solutions without any failure. Compared to very limited the number of tasks and the number of links used in the experimental studies previously (e.g. around 30 in [2], at most 24 in [6]), the task graphs generated for our experiments include graphs with 77 tasks and 110 links at most. It should be noted that even the benchmark datasets used in the previous work such as [6,18] were generated by TGFF [5].
We observed the objective values of Stage I and Stage II CP models, these are summarized by the box-plots depicted in Figure 4 of [4] and Figure 5 of [4] respectively for the 3 3 mesh architecture. The range of the distribution can be easily visualized
by a box-plot. Outliers are presented at the tail or the head of the box-plot as circles. As the complexity of the underlying problems increases, the objective values of the CP models increase as well, since they are minimization problems.
Power models given in Sect. 2 indicate relationship between energy consumption and the number of hops that packets that traverse between PEs. Therefore, it is assumed that the actual power usage will be a function of the current objective function of Stage I CP model which is the sum of information ow multiplied by the distance between PEs. Therefore, our formulation successfully addresses the issue of energy-aware design of NoC architecture. The completion time of all tasks in a task graph is used as the objective function in Stage II CP model i.e. makespan. Some other objectives such as task (job) tardiness or lateness can be used as well. But it is still safe to use universally the latest task completion time as the objective function to minimize in scheduling problems. Notice that, none of the scheduling problem cases was found to be infeasible, all of them returned the best solution.
The distributions of latency at packet level for each instance are depicted in Figure 6 of [4] for 3 3 mesh architecture. Due to the network congestions by increasing com
plexity of CTGs, it is expected to have higher latencies for the packet transmissions. We can conclude from Figure 6 of [4] that the variation of the packet latency increases as well by increasing the complexity of the CTG with respect to the congestion of the network.
Figures 7, 8, 9 of [4] summarize the results from the 44 mesh architecture. When
we compare Figures 5 and 8 of [4], we can see the benets of the extra resources (i.e. PEs) effects on the clock time. As expected the 4 4 mesh architecture completes
the tasks earlier than the 3 3 mesh architecture. On the other hand, the application
scheduling task becomes more complex by increasing the mesh size due to the extra decision variables and the resources in the model as seen in Table 1 of [4]. However, the CP models were able to return the results within the same run-time limits as the experiments for the 3 3 mesh architecture.
5 Application on real benchmark datasets
In addition to the tasks data sets that were synthetically generated by a random problem generator TGFF, we have employed real application benchmark datasets to evaluate the mapping and the scheduling algorithm. Multi-constraint system-level (MCSL) benchmark suite [10] provides a set of real applications where each application composes multiple tasks and trafc data patterns between these tasks. MCSL benchmark
123
588 A. Demiriz et al.
Table 1 MCSL benchmarksuite applications Application Number of tasks Number of
comm. links
R-S code encoder 248 328
R-S code decoder 278 390
ROBOT 88 131
SPEC95 FPPPP 334 1,145
SPARSE 96 67
H.264 video decoder 2,311 3,461
Table 2 MCSL 44 mesh results (Average Std. dev.)Stage I Stage II
Application Objective Num. of sol. Objective Num. of sol. Run time (s) Latency
R-S32ENC 1,3240 70 182118.73 1.60.5 0.710.26 50.01 R-S32DEC 2,2080 90 2,961151 3.81.8 1,6000 12.370.45
ROBOT 10,169120 195 92,8181,914 10 1.590.35 7.910.07 FPPPP 161,021914 15.73.68 85,3715,914 1.60.7 385443 149.46.8
SPARSE 22,206365 19.55 19,9821,101 1.20.4 4.32.24 14.493.1H.264 1,501,5053,007 192.36 19,532,265727,341 10 23,335597 63290
records the data trafc for different mesh network sizes and measures the execution time for each task in the application.
Table 1 shows the applications provided by MCSL which were used as data sets of our mapping and scheduling algorithms. Table 1 show also the number of task of each applications as well as the number of communication edges.
The les obtained by MCSL benchmark includes tasks execution times, the details of communication links between task, and the amount of data on each communication link. The execution time is represented by clock cycles while the data is represented by number of words on each link. Words are 32-bit wide which corresponds to one it. Each packet contains one header it and eight data body its. Between two network nodes, header its require three clock cycles while data it require only one clock cycle.
For each different application in MCSL Benchmark, we generated 10 different random realizations of the execution times and the trafc patterns according to distributional parameters provided in the benchmark data specically the les with STP extension [10]. Two different sizes of the mesh architecture were utilized in our experiments: 44 and 88. The packet size was set to be eight for all the applications
except H.264 which was set to be 64 due to the computational complexity caused by the small packet size in Stage II. The CP models for both Stage I and Stage II used for experiments in Sect. 4 were utilized without any change.3
Table 2 presents the results from experiments on 44 mesh architecture. All the
results were averaged over ten different random realizations and standard deviations
3 All the related model and data les can be accessed at http://tinyurl.com/cjseuuz
Web End =http://tinyurl.com/cjseuuz .
123
Using CP for designing NoC 589
Table 3 MCSL 88 mesh results (Average Std. dev.)
Stage I Stage II
Application Objective Num. of sol. Objective Num. of sol. Run time (s) Latency
R-S32ENC 1,380 0 41 0 1,752 18.64 2.6 0.5 1.9 0.82 5 0.03 R-S32DEC 4,020 0 14 0 2,959 150 12.6 1.7 3,200 0 17.38 0.82
ROBOT 9,808 299 50 9 92,452 1,871 1 0 1.43 0.25 7.80 0.22 FPPPP 304,282 3,506 22.8 6.66 85,317 5,913 1.3 0.7 499 502 196.4 9.4
SPARSE 19,125 495 58.1 8.8 20,012 1,101 1.2 0.4 7.1 3.6 17.73 5.7
were also reported after the operator. The total CPU time limit was set to be 1,000 s
in all the experiments for the application mapping model (i.e. Stage I) except H.264 which was set to 3,600 s. We set the time limit to 1,600 s in Stage II (i.e. the application scheduling model). We reported the objective values and number of different feasible solutions in both Stage I and Stage II CP models. None of Stage I models ran to the optimality. Therefore reported results are based on the best solutions found until the run time limit. However, all Stage II models produced the optimal scheduling except for the R-S32DEC and H.264 applications. We also reported CPU times for Stage II which differ from the run time limits as most scheduling problems were solved to the optimality within seconds. The most challenging problem was the application scheduling part of the H.264 benchmark. To give an idea about the complexity, a representative problem might have over 67 k variables and over 79 k constraints for the H.264 benchmark data in Stage II. Table 2 also reports the latencies in the unit of clock cycles.
Similarly, Table 3 presents results from experiments on the 88 mesh architecture.
However, the results from H.264 were omitted as the second stage becomes too complex and results in memory problems because there are around 137 k variables and 178 k constraints. The CPU time limits were set to twice as much as the ones in the 44 experiments. Since the CPU time limits were higher, the CP solver was able to
nd more feasible solutions in the rst stage (application mapping). However, the CP solver was not able to guarantee that Stage I results were optimal as in the case of 44
experiments.
A representative sample of the progression of the objective value at Stage I is given in Fig. 1. The x-axis represents the number of branches that are generated thus far. Basically, the CP solver was able to sift through approximately 700 k branches within CPU time limit which is 2,000 s for the 88 experiments. Each objective value
corresponds to a unique solution. At the end of the run, the CP solver reports the best solution found within the CPU time limits.
Finally, we compare the alternative modeling approaches for the mapping problem in this section; namely MIP and CP models. Table 4 reports average objective values of Stage I over 10 random realizations for both 44 and 88 mesh architectures. Our
implementation of the MIP model was adapted from [17]. To prevent the null (i.e. the zero objective value) solution, the model used in [17] incorporates a new constraint not to assign a specic core to the same position on the mesh as its label. Therefore,
123
590 A. Demiriz et al.
Progression of Objective Value at Stage I
45
Objective Value (in thousands)
40
35
30
25
20
0 100 200 300 400 500 600 700
Iteration Number (in thousands)
Fig. 1 Progression of objective value at stage I of SPARSE dataset on 8 8 architecture
Table 4 Comparing avg. objective values of MIP and CP on stage I model
Application MIP Results on44 Mesh
CP Resultson 88
Mesh
R-S code encoder 13,48 1,324 9,840 1,380
R-S code decoder 2,604 2,208 7,002 4,020
ROBOT 13,270 10,169 71,254 9,808
SPEC95 FPPPP 160,072 161,020 395,005 304,283
SPARSE 26,334 22,206 95,983 19,125
H.264 video decoder 2,663,004 1,505,505 NA NA
CP Results on 44
Mesh
MIP Results on 88 Mesh
Prog. of Obj. Values in Stage I - MIP Model
70
Objective Value (in thousands)
60
50
40
30
20
10
0
0 10 20 30 40 50 60
Number of Nodes (in thousands)
Fig. 2 Progression of object value of stage I MIP model on SPARSE dataset on 4 4 architecture
there were some differences between objective values of MIP and CP models and the CP models yielded better results in almost all the cases except the one designated as SPEC95 FPPPP. Our MIP implementation is also provided in the experimental les
123
Using CP for designing NoC 591
available in the web page mentioned in the footnote above. Run time limits were chosen to be comparable to the experiments from the CP models.
A representative sample of progression of the MIP solution is given in Fig. 2. There were nearly 270 k nodes generated in this particular run, but the x-axis of Fig. 2 ends around 60 k for the visual clarity as the objective value does not change anymore.
6 Conclusion and future work
A CP-based two-stage model is proposed to solve the mapping and the application scheduling problems listed in [13] as outstanding research problems in application modeling and optimization for NoC architectures. The major advantage of using CP is the clarity and understandability of the models. The CP modeling is more exible than the MIP on many challenging optimization problems, including mapping and scheduling. We successfully experimented our model on various benchmark datasets generated by TGFF. Deterministic XY-routing with wormhole switching is successfully used for the task and the data communication scheduling.
The proposed approach in this paper has also been applied to real application benchmark datasets. Results show clear advantages of CP models when they are used both for mapping and scheduling problems. Application benchmark datasets have purposely been chosen to be challenging where several thousands of tasks and communication links form the trafc patterns.
A logical extension to the homogeneous platform is to study the case of having heterogeneous PEs. The challenge of dynamic voltage frequency island (VFI) is an open problem in terms of providing robust models [6,15]. Therefore, an extension of the CP model should be in the direction of studying VFI on heterogeneous platforms and adaptive routing. The work in this paper can be extended to real applications on various processing platforms beyond 88 mesh architecture in the
future.
References
1. Baptiste P, Laborie P, Pape CL, Nuijten W (2006) Handbook of constraint programming, chap 22. Constraint-based scheduling and planning. Elsevier, Amsterdam, pp 761799
2. Benini L, Bertozzi D, Guerri A, Milano M (2005) Allocation and scheduling for mpsocs via decomposition and no-good generation. In: van Beek P (ed) Principles and practice of constraint programming CP 2005. Lecture notes in computer science, vol 3709. Springer, Berlin, pp 107121. doi:http://dx.doi.org/10.1007/11564751_11
Web End =10.1007/ http://dx.doi.org/10.1007/11564751_11
Web End =11564751_11
3. De Micheli G, Seiculescu C, Murali S, Benini L, Angiolini F, Pullini A (2010) Networks on chips: from research to products. In: Sapatnekar SS (ed) DAC. ACM, pp 300305
4. Demiriz A, Bagherzadeh N, Alhussein A (2013) Cpnoc: On using constraint programming in design of network-on-chip architecture. In: Parallel, Distributed and Network-Based Processing (PDP), 2013 21st Euromicro International Conference on, pp 486493. doi:http://dx.doi.org/10.1109/PDP.2013.78
Web End =10.1109/PDP.2013.78
5. Dick RP, Rhodes DL, Wolf W (1998) Tgff: task graphs for free. In: Borriello G, Jerraya AA, Lavagno L (eds) CODES, IEEE Computer Society, pp 97101
6. Ghosh P, Sen A (2010) Efcient mapping and voltage islanding technique for energy minimization in noc under design constraints. In: Shin, SY Ossowski, S Schumacher, M Palakal, MJ Hung CC (eds.) SAC. ACM, New York, pp 535541
123
592 A. Demiriz et al.
7. He O, Dong S, Jang W, Bian J, Pan DZ (2011) Unism: unied scheduling and mapping for general networks on chip. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, PP(99):114. doi:http://dx.doi.org/10.1109/TVLSI.2011.2159280
Web End =10.1109/TVLSI.2011.2159280
8. van Hoeve WJ, Katriel I (2006) Handbook of constraint programming, chap 6 Global Constraints. Elsevier, Amsterdam, pp 169208
9. Hu J, Marculescu R (2005) Energy- and performance-aware mapping for regular noc architectures. IEEE Trans CAD Integr Circuits Syst 24(4):551562
10. Liu W, Xu J, Wu X, Ye Y, Wang X, Zhang W, Nikdast M, Wang Z (2011) A noc trafc suite based on real applications. In: VLSI (ISVLSI), 2011 IEEE Computer Society Annual Symposium on, pp 6671. doi:http://dx.doi.org/10.1109/ISVLSI.2011.49
Web End =10.1109/ISVLSI.2011.49
11. Loiola EM, de Abreu NMM, Netto POB, Hahn P, Querido TM (2007) A survey for the quadratic assignment problem. Eur J Oper Res 176(2):657690
12. Lombardi M, Milano M (2012) Optimal methods for resource allocation and scheduling: a cross-disciplinary survey. Constraints 17(1):5185. doi:http://dx.doi.org/10.1007/s10601-011-9115-6
Web End =10.1007/s10601-011-9115-6
13. Marculescu R, Ogras Y, Peh LS, Jerger NDE, Hoskote YV (2009) Outstanding research problems in noc design: system, microarchitecture, and circuit perspectives. IEEE Trans CAD Integr Circuits Syst 28(1):321
14. Michel L, Van Hentenryck P (2010) Wiley encyclopedia of operations research and management science, chap. Basic CP theory: search. Wiley, New York. doi:http://dx.doi.org/10.1002/9780470400531.eorms0088
Web End =10.1002/9780470400531.eorms0088
15. Ogras Y, Marculescu R, Marculescu D, Jung EG (2009) Design and management of voltage-frequency island partitioned networks-on-chip. IEEE Trans VLSI Syst 17(3):330341
16. Ruggiero M, Guerri A, Bertozzi D, Milano M, Benini L (2008) A fast and accurate technique for mapping parallel applications on stream-oriented mpsoc platforms with communication awareness. Int J Parallel Program 36:336. doi:http://dx.doi.org/10.1007/s10766-007-0032-7
Web End =10.1007/s10766-007-0032-7
17. Shcherbina O. Quadratic assignment test problems. http://www.mat.univie.ac.at/~oleg/TEST/Chapter13/HCP13.5.1.mod
Web End =http://www.mat.univie.ac.at/~oleg/TEST/ http://www.mat.univie.ac.at/~oleg/TEST/Chapter13/HCP13.5.1.mod
Web End =Chapter13/HCP13.5.1.mod . Accessed Apr 2013
18. Srinivasan K, Chatha KS, Konjevod G (2006) Linear-programming-based techniques for synthesis of network-on-chip architectures. IEEE Trans VLSI Syst 14(4):407420
19. Zhang H, Beltran-Royo C, Constantino M (2010) Effective formulation reductions for the quadratic assignment problem. Comput OR 37(11):20072016
123
Springer-Verlag Wien 2015