Content area
The simplex method is frequently the most efficient method of solving linear programming (LP) problems. This paper reviews previous attempts to parallelise the simplex method in relation to efficient serial simplex techniques and the nature of practical LP problems. For the major challenge of solving general large sparse LP problems, there has been no parallelisation of the simplex method that offers significantly improved performance over a good serial implementation. However, there has been some success in developing parallel solvers for LPs that are dense or have particular structural properties. As an outcome of the review, this paper identifies scope for future work towards the goal of developing parallel implementations of the simplex method that are of practical value. [PUBLICATION ABSTRACT]
Comput Manag Sci (2010) 7:139170
DOI 10.1007/s10287-008-0080-5
ORIGINAL PAPER
Towards a practical parallelisation of the simplex method
J. A. J. Hall
Published online: 11 October 2008 Springer-Verlag 2008
Abstract The simplex method is frequently the most efcient method of solving linear programming (LP) problems. This paper reviews previous attempts to parallelise the simplex method in relation to efcient serial simplex techniques and the nature of practical LP problems. For the major challenge of solving general large sparse LP problems, there has been no parallelisation of the simplex method that offers signicantly improved performance over a good serial implementation. However, there has been some success in developing parallel solvers for LPs that are dense or have particular structural properties. As an outcome of the review, this paper identies scope for future work towards the goal of developing parallel implementations of the simplex method that are of practical value.
Keywords Linear programming Simplex method Sparse Parallel computing
Mathematics Subject Classication (2000) 90C05
1 Introduction
Linear programming (LP) is a widely applicable technique both in its own right and as a sub-problem in the solution of other optimization problems. The simplex method and interior point methods are the two main approaches to solving LP problems. In a context where families of related LP problems have to be solved, such as integer programming and decomposition methods, and for certain classes of single LP problems, the simplex method is usually more efcient.
J. A. J. Hall (B)
School of Mathematics, University of Edinburgh, JCMB, Kings Buildings, Edinburgh EH9 3JZ, UK e-mail: [email protected]
123
140 J. A. J. Hall
The application of parallel and vector processing to the simplex method for linear programming has been considered since the early 1970s. However, only since the beginning of the 1980s have attempts been made to develop implementations, with the period from the late 1980s to the late 1990s seeing the greatest activity. Although there have been a few experiments using vector processing and shared memory machines, the vast majority of implementations have made use of distributed memory multiprocessors and Ethernet-connected clusters.
The initial aim of this paper is to provide a comprehensive review of past approaches to exploiting parallelism in the simplex method, including an assessment of the extent to which they have yielded implementations of practical value. To facilitate this, Sect. 2 introduces the simplex method and discusses issues of implementation and computational characteristics that inuence its parallelisation. A short overview of the nature of practical LP problems and suitable test problems is given in Sect. 3. Terms and concepts in parallel computing that are used in this paper are introduced briey in Sect. 4.
It is clear from the review of past work in Sect. 5 that, in most cases, the focus of attention has been the development of techniques by which speed-up can be achieved. Only rarely has any thought been given to the underlying computational scheme in terms of serial efciency and numerical stability. As a consequence, although many implementations have demonstrated good speed-up, and a few were worthwhile parallel challenges at the time, fewer still have been of practical value.
The second aim of the paper is to identify promising computational strategies for worthwhile future parallel implementations. This is done in Sect. 6. Although the feasibility of their implementation is considered, detailed discussion of techniques and architectures is beyond the scope of this paper.
2 The simplex method
The simplex method and its computational requirements are most conveniently discussed in the context of LP problems in standard form
minimize cT x
subject to Ax = b
x 0,
Rm. The matrix A in (2.1) usually contains columns of the identity corresponding to logical (slack) variables introduced to transform inequality constraints into equations. The remaining columns of A correspond to structural (original) variables.
In the simplex method, the indices of variables are partitioned into sets B corres
ponding to m basic variables xB, and N corresponding to n m nonbasic variables
xN, such that the basis matrix B formed from the columns of A corresponding to B
is nonsingular. The set B itself is conventionally referred to as the basis. The columns
of A corresponding to N form the matrix N. The components of c corresponding to B and N are referred to as, respectively, the basic costs cB and non-basic costs cN.
When the nonbasic variables are set to zero the values b = B1b of the basic
variables, if non-negative, correspond to a vertex of the feasible region. The expression
Rn and b
(2.1)
where x
123
Towards a practical parallelisation of the simplex method 141
xB + B1N xN = b derived from the equations in (2.1) allows the basic variables to
be eliminated from the objective which becomes (cTN cTB B1N)xN + cTB b. If none
of the components of the vector of reduced costsN = cTN cTB B1N is negative then
the current basis is optimal.
In each iteration of the simplex method, if the current basis is not optimal a nonbasic variable xq with negative reduced cost is chosen to enter the basis. Increasing this variable from zero whilst maintaining the equations in (2.1) corresponds to moving along an edge of the feasible region such that the objective function decreases. The direction of this edge is given by the columnq of N = B1N corresponding to xq.
By considering the ratios of the components of b to the corresponding components of aq (when positive), the simplex method nds the rst basic variable to be reduced to zero as xq is increased and, hence, the step to the next vertex of the feasible region along this edge.
There are many strategies for choosing the nonbasic variable xq to enter the basis. The original rule of choosing the variable with the most negative reduced cost is commonly referred to as the Dantzig criterion. However, if the components of j are large
relative toj it is likely that only a small increase in x j will be possible before one of
the basic variables is reduced to zero. Alternative pricing strategies weight the reduced cost by dividing it by (a measure of) the length of j . The exact steepest edge strategy,
described by Goldfarb and Reid (1977), maintains weights sj = 1+ j 2 correspon
ding to the step length for a unit change in x j . Practical (approximate) steepest edge techniques are described by Forrest and Goldfarb (1992) and the Devex strategy due to Harris (1973) maintains approximate edge weights. Using these strategies, the number of iterations required to solve an LP problem in practice can be taken as O(m +n) and
no practical problem is known for which the theoretical complexity of O(2n) is achieved. A further consequence of edge-weight pricing strategies is a reduction in the frequency with which variables leave and enter the basis, leading to improved numerical stability.
A popular technique for choosing the variable to leave the basis is the EXPAND procedure of Gill et al. (1989). By expanding the constraint bounds by a small amount, this strategy often enables the leaving variable to be chosen from a set of possibilities on grounds of numerical stability.
The two main variants of the simplex method correspond to different means of calculating the data required to determine the step to the new vertex. The rst variant is the standard simplex method in which the reduced costs and the directions of all edges at the current vertex are maintained in a rectangular tableau. In the revised simplex method, the reduced costs and the direction of the chosen edge are determined by solving systems involving the basis matrix B.
2.1 The standard simplex method
In the standard simplex method the matrix N, the (reduced) right-hand side vector b,
the reduced costsN and current value of the objective f = cTB b are maintained in a
tableau of the following form.
123
142 J. A. J. Hall
N RHS
B N b
cTN f
Each iteration of the standard simplex method requires a GaussJordan elimination operation to be applied to the columns of the tableau so that the updated tableau corresponds to the new basis.
The simplex method is commonly started from a basis for which B = I so the
matrix in the standard simplex tableau is N. As such the tableau is sparse. It is widely assumed that the extent of ll-in caused by the elimination operations is such that it is not worth exploiting sparsity. As a consequence, the standard simplex method is generally implemented using a dense data structure.
The standard simplex method is inherently numerically unstable since it corresponds to a long sequence of elimination operations with pivots chosen by the simplex algorithm rather than on numerical grounds. If the simplex method encounters ill-conditioned basis matrices, any subsequent tableau corresponding to a well-conditioned basis can be expected to have numerical errors reecting the earlier ill-conditioning. This can lead to choices of entering or leaving variables such that, with exact arithmetic, the objective does not decrease monotonically, feasibility is lost, or the basis matrix becomes singular. Robustness can only be achieved by monitoring errors in the tableau and, if necessary, recomputing the tableau in a numerically stable manner. Error checking can be performed by comparing the updated reduced cost with the value computed directly using the pivotal column and basic costs. Alternatively, since operations with the inverse of the basis matrix can be performed using appropriate entries of the tableau, calculating the pivotal column directly and comparing this with the tableau entries would provide a more extensive but expensive error checking mechanism.
2.2 The revised simplex method
The computational components of the revised simplex method are illustrated in Fig. 1. At the beginning of an iteration, it is assumed that the vector of reduced costsN and
the vector b of current values of the basic variables are known and that a representation
of B1 is available. The rst operation is CHUZC which scans the (weighted) reduced costs to determine a good candidate q to enter the basis. The pivotal columnq is
formed using the representation of B1 in an operation referred to as FTRAN.
The CHUZR operation determines the variable to leave the basis, with p being used to denote the index of the row in which the leaving variable occurred, referred to as the pivotal row. The index of the leaving variable itself is denoted by p . Once the indices q and p have been interchanged between the sets B and N , a basis change is
said to have occurred. The RHS vector b is then updated to correspond to the increase
= bp/pq in xq.
Before the next iteration can be performed it is necessary to obtain the reduced costs and a representation of the new matrix B1. Although the reduced costs may be
123
Towards a practical parallelisation of the simplex method 143
Fig. 1 Operations in an iteration of the revised simplex method
computed directly using the following operations:
TB = cTB B1;TN = cTN TB N, (2.2)
it is more efcient computationally to update them by computing the pivotal row
aTp = eTp B1N of the standard simplex tableau. This is obtained in two steps. First
the vector Tp = eTp B1 is formed using the representation of B1 in an operation
known as BTRAN, and then the vectorTp = Tp N of values in the pivotal row is
formed. This sparse matrixvector product with N is referred to as PRICE. Once the reduced costs have been updated, the UPDATE operation modies the representation of B1 according to the basis change. Note that, periodically, it will generally be either more efcient, or necessary for numerical stability, to nd a new representation of
B1 using the INVERT operation.
When using the Devex strategy (Harris 1973) the pivotal row computed to update the reduced costs is used to update the Devex weights at no signicant computational cost. To update the exact steepest edge weights requires, in addition to the pivotal row, a further BTRAN operation to formTq B1 and a further
PRICE operation to form the product of this vector with N. It is also computationally expensive to initialise steepest edge weights if the initial basis matrix is not the identity. As a consequence of these overheads, and since the Devex strategy performs well in terms of reducing the number of iterations required to solve LP problems, Devex is commonly the default in efcient sequential implementations of the revised simplex method.
2.3 The revised simplex method with multiple pricing
When in-core memory was severely restricted, a popular variant of the revised simplex method was to incorporate minor iterations of the standard simplex method, restricted to a small subset of the variables. This is described by Orchard-Hays (1968) and is referred to as multiple pricing. The major computational steps of this variant, together with those required when using Devex or steepest edge weights, are illustrated in Fig. 2. The features of these variants, in particular those incorporating Devex or steepest edge, make exploiting parallelism particularly attractive.
The CHUZC operation scans the (weighted) reduced costs to determine a set Q of
good candidates to enter the basis. The inner loop then applies the standard simplex
/
/
{ }
123
144 J. A. J. Hall
{ }
{ }
{ }
{ }
{ }
{ }
{ }
{
{ }
/
{
/
}
,
}
Fig. 2 The revised simplex method with multiple pricing
method to the LP problem corresponding to the candidates in Q and, therefore, requires
the corresponding columns j = B1a j , j Q, of the standard simplex tableau.
These columns are formed as a multiple FTRAN. The matrix formed by the columns
a j , j Q and the corresponding vector of reduced costs for the candidates j Q are
conveniently denoted byQ andQ.
In each minor iteration, CHUZC_MI scans the (weighted) reduced costs of the can
didates in Q and selects one, q say, to enter the basis. Once a basis change has been
determined by CHUZR and the RHS has been updated, the standard simplex tableau corresponding to the new basis is obtained by updating the previous tableau in an operation known as UPDATE_MI. The matrixQ and reduced costsQ are updated by
the standard GaussJordan elimination step. Any Devex weights are updated using row p of the updated tableau. Any steepest edge weights may be computed directly using the known tableau columns. Minor iterations are terminated when either Q is
empty or there are no negative reduced costs for q Q.
An original advantage of multiple pricing was that the reduced costs were only computed (from scratch) every major iteration using (2.2). As a result, the frequency with which BTRAN and, in particular, PRICE were performed was reduced signicantly.
123
Towards a practical parallelisation of the simplex method 145
Leaving aside any advantages due to efcient use of memory, the value of the revised simplex method with multiple pricing depends on the extent to which the variables that were good candidates to enter the basis when Q was formed remain
good candidates, or even remain attractive, during the course of minor iterations. This property is commonly referred to as candidate persistence and its effect depends upon both the problem being solved and the number of candidates chosen when forming Q.
Since the entering variable in the second and subsequent minor iteration is not necessarily the best candidate according to the original selection criterion, it is possible that the number of basis changes required to solve the LP problem will increase, perhaps signicantly. If any of the original candidates becomes unattractive and minor iterations terminate with Q not empty, the
FTRAN and update operations required to form and maintain the tableau column for such a candidate are wasted.
2.4 The representation of B1
Forming and updating the explicit inverse matrix of B as a dense data structure is simple and, if reformed when appropriate, is numerically stable. The FTRAN and BTRAN
operations thus reduce to dense matrixvector products. For general large sparse LP problems the computational cost of an iteration of the revised simplex method with a dense inverse is O(m2). This is comparable with the O(mn) cost of the standard simplex method, and so is of no greater practical interest. Efcient implementations of the revised simplex method use a data structure corresponding to a factored representation of B1. This representation is based on that obtained by the INVERT operation for some basis matrix B0. There are various techniques for updating the factored representation of B1, the original and simplest being the product form update of Dantzig and Orchard-Hays (1954).
2.4.1 Procedures for INVERT
Although there are many techniques for factorising a general unsymmetric sparse matrix, in the context of the revised simplex method, there is much commonality of approach. The general aim is to use Gaussian elimination to obtain a factored representation of B1, seeking a reconciliation of the partially conicting goals of low computational cost, low ll-in and numerical stability. Using an active submatrix that is, initially, B, Gaussian elimination determines a sequence of m entries to be used as pivots. Each pivot is used to eliminate any other entries in the corresponding column. The active submatrix is then updated by removing the row and column corresponding to the pivot. No INVERT technique is optimal for all LP problems.
LP basis matrices frequently contain signicant numbers of columns of the identity matrix corresponding to logical variables. These are examples of singleton columns whose entries can be used as pivots without incurring any ll-in. It is also likely that there will be singleton rows that can be used as pivots similarly. As singletons are identied, pivoted on and their corresponding rows and columns removed from the active submatrix, further singletons may be created. The use of this triangularisation phase is due to Orchard-Hays (1968) and ends when no further singletons can be
123
146 J. A. J. Hall
identied. Any remaining rows and columns contain at least two nonzeros and form what is frequently referred to as the bump.
For certain LP problems, in particular network problems, the triangularisation phase only terminates when all rows and columns have been pivoted on. This corresponds to permuting the rows and columns of B so that it is triangular, allowing B1 to be represented without incurring any ll-in. For many LP problems, the number of rows and columns remaining after triangularisation is very small. In this case, the technique used to factor the bump is largely immaterial since the computationally intensive part of INVERT is triangularisation and any ll-in is small relative to the number of nonzeros in the representation of B1. For other LP problems, the size of the bump is comparable to that of the matrix, in which case it is important how the bump is factored. A simple technique due to Tomlin (1972) is effective for all but the most numerically awkward and structurally unfortunate LP problems, for which a Markowitz (1957) strategy such as that used by Suhl and Suhl (1990) may be preferable.
In a typical implementation of the revised simplex method for large sparse LP problems, B10 is represented as a product of KI elimination operations derived directly from the nontrivial columns of the matrices L and U in the LU decomposition of (a row and column permutation of) B0 resulting from the elimination. This invertible representation allows B10 to be expressed algebraically as B10 = 1k=KI E1k, where
E1k =
1 k1
... ...1 kpk1
1
kpk+1 1
... ...
km 1
1
...
1
k
. (2.3)
1
...
1
Within an implementation, the nonzeros in the eta vector
k = [k1 . . . kpk1 0 kpk+1 . . . km]T
are stored as value-index pairs and the data structure {pk, k, k}KIk=1 is referred to as
an eta le. Eta vectors associated with the matrices L and U coming from Gaussian elimination are referred to as L-etas and U-etas respectively. Using the generic RHS vector r for simplicity, the operations with pk, k and k required when forming E1kr during the standard algorithms for FTRAN and ETk r during BTRAN are illustrated in
Fig. 3.
2.4.2 Updating the factored representation of B1
If EU is used to denote the transformation of B0 corresponding the basis changes following INVERT, such that B = B0EU, it follows that B1 may be expressed as
B1 = E1UB10. The product form update leaves the factored form of B10 unaltered
123
Towards a practical parallelisation of the simplex method 147
Fig. 3 Standard operations in
FTRAN and BTRAN
(a) (b)and represents E1U as a product of pairs of elementary matrices of the form (2.3). In the UPDATE operation, the modication of B1 corresponding to the basis change is obtained directly from the pivotal column and is given by pk = p, k = 1/pq and
k =q pq ep.
In solvers based on the product form, the representation of the UPDATE operations can be appended to the eta le following INVERT, resulting in a single homogeneous data structure. However, in this paper, the particular properties of the INVERT and
UPDATE etas and the nature of the operations with them may be exploited, so FTRAN
is viewed as the pair of operations
aq = B10aq (
I-FTRAN)
followed by
aq = E1U aq. (
U-FTRAN)
Conversely, BTRAN is viewed as
T = rT E1U (
U-BTRAN)
followed by
T =
T B10. (I-BTRAN)
Other procedures for UPDATE modify the factored representation of B10 with the aim of reducing the size of the eta le and achieving numerical stability. Whilst they may be more efcient in serial for some problems, there is a signicant value in leaving the factored representation of B10 unaltered. Techniques for improving the efciency of FTRAN and BTRAN, both in serial and parallel, may require an analysis of the representation of B10 following INVERT. This investment is likely only to be worthwhile if many FTRAN and BTRAN operations are performed with the original representation of B10.
2.5 Hyper-sparsity
For most sparse LP problems the pivotal columnq has sufcient zero entries that it is
efcient to exploit this when creating the corresponding product form eta in UPDATE.
123
148 J. A. J. Hall
Similarly, it is often found that the vector Tp = eTp B1 has a signicant proportion
of zero entries. Thus, if PRICE is performed as a set of inner products Tp ai, i N ,
there will be many operations with zeros. In this case it is more efcient to evaluate Tp N as a linear combination of the rows of N corresponding to the nonzeros in p, even allowing for the overhead of updating a row-wise representation of N following each basis change.
For some important classes of LP problems, the number of nonzeros in the vectors
aq, p, and pivotal tableau rowTp is sufciently small to be worth exploiting further.
This property, which Hall and McKinnon (1999) refer to as hyper-sparsity, has also been identied by Bixby (2002) and Bixby et al. (2000). For such problems, Hall and McKinnon (2005) have shown that signicant performance improvement can be gained by exploiting hyper-sparsity. This is achieved by maintaining the indices of the nonzeros in these vectors and exploiting this information when they are both formed and used. Practical experience shows that the threshold at which it is worth exploiting hyper-sparsity corresponds to an average density of pivotal columns (and hence the standard simplex tableau itself) of about 10%. For the basis matrices of LP problems that exhibit hyper-sparsity the dimension of the bump within INVERT is very small
relative to that of the matrix.
For some LP problems B1 is typically sparse. This property has been exploited in some implementations of the revised simplex method based on an explicit inverse.
However, it is general for the number of nonzeros in B1 (even when freshly formed) to be vastly more than the number of nonzeros in an efcient factored representation.
This property is illustrated, for example, by Shu (1995).
2.6 PRICE when the column/row ratio is large
For LP problems with a large column/row ratio (n m), the cost of the revised
simplex variants set out in Figs. 1 and 2 is dominated by that of PRICE. However, for such problems, it is typical for a good candidate to enter the basis to be identied from only a small subset of the columns. The numerous variants of partial and multiple pricing (Orchard-Hays 1968) stem from this observation. In general, their use leads to a reduction in the solution time that is signicant, and in some cases vast, despite the fact that they are largely incompatible with the use of edge weight pricing strategies.
3 Practical LP problems
The following overview provides the necessary background on LP problems for reviewing parallel simplex techniques in Sect. 5 in terms of their value as practical LP solvers. In view of some of the LP problems that have been used to test parallel simplex implementations, a few observations on this issue are also in order.
A direct consequence of many modelling techniques that generate large LP problems is structure and sparsity in the constraint matrix. Variable and constraint sets
123
Towards a practical parallelisation of the simplex method 149
corresponding to related LP subproblems are linked to build large single problems. Two important classes of practical LP problems have block-angular structure. Applications involving decentralised planning and specic problems such as multicommodity network ow yield block-angular LP problems with linking rows. Stochastic programming problems in applications such as asset liability management yield block-angular LP problems with linking columns. Depending on the application, the nature of the blocks may range from being fully dense to highly sparse and structured, as in the case of multicommodity ow problems.
In determining the computational challenge of solving a given LP problem, structure and sparsity are often more important than problem dimension. In particular, they have a critical inuence on which of the simplex and interior point methods is the more efcient solution procedure. For problems that exhibit hyper-sparsity when solved using the revised simplex method, it is typical for an interior point solver to be several times slower and in some cases the margin is more than an order of magnitude. For LP problems that are not (fully) hyper-sparse, an interior point solver is usually faster by similar factors. It is not properly understood what model properties determine whether the LP will exhibit hyper-sparsity, but it is often associated with a signicant amount of network structure. In terms of developing a parallel simplex solver of practical value, two challenges emerge. A successful parallelisation of the revised simplex method when exploiting hyper-sparsity would lead immediately to a solver of great value, whereas a parallel simplex solver for problems that are not hyper-sparse may catch up valuable lost ground in the competition with interior point methods. These challenges are discussed in greater detail in Sect. 6.
There are a few applications such as the design of lters and wavelet analysis which yield large dense LP problems. Their existence is used to justify the practical value of developing parallel implementations of the standard simplex method and the revised simplex method with dense matrix algebra. However, the existence of these applications should not be over-played, nor obscure the fact that the development of methods for the efcient solution of sparse LP problems remains the overwhelmingly important goal.
As well as sparsity and structure, the degeneracy and numerical properties of an LP problem are important factors affecting the computational challenge posed by its solution. Both of these attributes are usually a consequence of the modelling process. For this reason, randomly generated problems offer a poor means of testing the performance of an LP solver, with random sparsity patterns being particularly misleading. Despite the availability of sparse LP test problems, it is disappointing how many authors give results for randomly generated problems.
For many years the Netlib set (Gay 1985) was the standard source of general sparse LP test problems. Although a few are difcult to solve for numerical or structural reasons and some are highly degenerate, none can now be viewed as being large. The original Kennington set (Carolan et al. 1990) contains signicantly larger problems. The test problems used by Mittelmann for benchmarking commercial LP solvers (Mittelmann 2008) are large by modern standards. With values of mn between 107 and 1012, they reect the full range of attributes that contribute to the computational challenge of solving large sparse practical LP problems.
123
150 J. A. J. Hall
4 Parallel computing
In classifying the approaches used by attempts to parallelise the simplex method, reference is made to terms and concepts in parallel computing. This section introduces the necessary terms. A full and more general introduction to parallel computing is given by Kumar et al. (2003).
4.1 Parallel multiprocessors
When classifying parallel multiprocessor architectures, an important distinction is between distributed memory, when each processor has its own local memory, and shared memory, when all processors have access to a common memory. Modern massively parallel machines may consist of a set of distributed clusters, each of which has a number of processors with shared memory. On smaller multiprocessors memory may be either purely distributed or shared. With chip speed limiting increases in computing power, hardware vendors are turning their attention to the production of low-cost multi-core architectures.
4.2 Speed-up, scalability and Amdahls law
In general, success in parallelisation is measured in terms of speed-up, the time required to solve a problem with more than one parallel processor compared with the time required using a single processor. The traditional goal is to achieve a speed-up factor equal to the number of processors. Such a factor is referred to as linear speed-up and corresponds to a parallel efciency of 100%. The increase in available cache, and frequently RAM, with the number of parallel processors occasionally leads to the phenomenon of super-linear speed-up being observed. Parallel schemes for which, at least in theory, performance improves linearly and without limit as the number of processors increases are said to be scalable. If parallelism is not exploited in all major algorithmic operations then the speed-up is limited, according to Amdahls (1967) law, by the proportion of the serial execution time for the operations that have not been parallelised.
4.3 Programming paradigms and their implementation
There are two principal parallel programming paradigms. If the work of a major algorithmic operation may be distributed across a set of processors then data parallelism may be exploited. Conversely, if it is possible to perform several major algorithmic operations simultaneously, then task parallelism may be exploited. In practice, it may be possible to exploit both data parallelism and task parallelism for a particular (set of) major algorithmic operation(s).
There are two fundamental means of implementing algorithms on parallel machines. On distributed memory machines the concept of communicating data between processors using instructions initiated by explicit calls to message-passing subroutines is
123
Towards a practical parallelisation of the simplex method 151
natural and immediate. On shared memory machines the conceptually natural technique is data-parallel programming in which operations are coded as if they were to be executed serially but translated into a parallel executable by use of an appropriate compiler. Many message-passing environments are also supported on shared memory machines and data-parallel programming techniques are also supported on distributed memory machines.
On distributed memory machines, the overhead of sending messages to communicate data between processors is determined by both latency and bandwidth. The former is a time overhead that is independent of the size of the message and the latter is the rate of communication. For generic message-passing environments, the latency and bandwidth on a particular architecture may be signicantly higher than for an architecture-dependent environment that is usually supplied and tuned by the vendor. When an algorithm has a high ratio of communication to computation, growing communication overhead may outweigh any improvement in potential computational performance resulting from the use of an increased number of processors.
5 Parallelisation of the simplex method
Past approaches to exploiting parallelism in the simplex method and simplex-like methods are conveniently classied according to the variant of the simplex method that is considered and the extent to which sparsity is exploited. It will be seen that parallel implementations of simplex variants that are more efcient as a means of solving large sparse LP problems are less successful in terms of speed-up. Conversely, the simplex variants that are generally least efcient demonstrate the best speed-up.
A few of the parallel schemes discussed below offered good speed-up relative to efcient serial solvers of their time. However, some techniques that have been parallelised are only now seen as being inefcient in the light of serial revised simplex techniques that were either little known at the time or developed subsequently. This is identied below to emphasise that, as a result of the huge increase in efciency of serial revised simplex solvers both during and since the period of research into parallel simplex, the task of developing a practical parallel simplex-based solver has become much more difcult.
5.1 Parallel simplex using dense matrix algebra
The dense standard simplex method and the revised simplex method with a dense explicit inverse have been implemented in parallel many times. The simple data structures involved and potential for linear speed-up makes them attractive exercises in parallel computing. However, for solving general large sparse LP problems, the serial inefciency of these implementations is such that only with a massive number of parallel processes could they conceivably compete with a good sparsity-exploiting serial implementation of the revised simplex method.
Early work on parallelising the simplex method using dense matrix algebra was mainly restricted to discussion of data distribution and communication schemes, with implementations limited to small numbers of processes on distributed memory
123
152 J. A. J. Hall
machines. Surveys are given by Zenios (1989) and Luo et al. (1991), and examples of other early references to work in this area are due to Finkel (1987), Boffey and Hay (1989), Babaev and Mardanov (1991) and Agrawal et al. (1989). A relatively early work which is representative of this era is due to Stunkel (1988) who implemented both the dense standard simplex method and the revised simplex method with a dense inverse on a 16-processor Intel hypercube, achieving a speed-up of between 8 and 12 for small problems from the Netlib set (Gay 1985). Cvetanovic et al. (1991) report a speed-up of 12 when solving two small problems using the standard simplex method, a result that is notable for being achieved on a 16-processor shared memory machine. Luo and Reijns (1992) obtained speed-ups of more than 12 on 16 transputers when using the revised simplex method with a dense inverse to solve modest Netlib problems. However, they developed their parallel scheme on the assumption that the computational cost of updating the inverse using the expression B1 := E1kB1 was
O(m3), rather than O(m2) if the structure of E1k (2.3) is properly exploited, an error which might well explain their high efciency.
Eckstein et al. (1995) parallelised the standard simplex method and the revised simplex method with a dense inverse on the massively parallel Connection Machine CM-2 and CM-5, incorporating the steepest edge pricing strategy directly within their standard simplex implementation, rather than via an update formula for the edge weights as in Goldfarb and Reid (1977). As a consequence of using steepest edge weights and the EXPAND procedure (Gill et al. 1989), this implementation is notable for its numerical stability, an issue that has rarely been considered in parallel implementations of the simplex method. Further details are given by Bodurolu (1997). When solving a range of larger Netlib problems and very dense machine learning problems, speed-ups of between 1.6 and 1.8 were achieved when doubling the number of processors. They also presented results which indicated that the performance of their implementation was generally superior to the Minos 5.4 serial sparse revised simplex solver, particularly for the denser problems. Thomadakis and Liu (1996) also used steepest edge in their implementation of the standard simplex method on MasPar MP-1 and MP-2 machines. Solving a range of large, apparently randomly-generated problems, they achieved a speed-up of up to three orders of magnitude on the 128128 processor MP-2. More
recently, theoretical work on parallel implementations of the standard simplex method with steepest edge, and its practical implementation on modest numbers of processors, has also been reported by Yarmish (2001).
Reports of small-scale parallel implementations of the dense standard simplex method continue to appear. Relatively recently, Badr et al. (2006) presented results for an implementation on eight processors, achieving a speed-up of ve when solving small random dense LP problems.
5.2 Parallel simplex using sparse matrix algebra
The real challenge in developing a parallel simplex implementation of general practical value is to exploit parallelism when using efcient sparse matrix algebra techniques. Only then could the resulting solver be competitive with a good serial implementation when solving general large sparse LP problems using a realistic number of processors.
123
Towards a practical parallelisation of the simplex method 153
At the time when the parallelisation of the simplex method was rst and most widely considered, practical parallel factorisation and solution methods for sparse unsymmetric linear systems were in their infancy. As a consequence, although work on parallelising the simplex method using dense matrix algebra has generally exploited data parallelism, it was a commonly held view that when using sparse matrix algebra there was relatively little scope for exploiting data parallelism (with the exception of PRICE). This has led some to conclude that there is no real scope for developing a
worthwhile parallel implementation. However, not only is this (now) open to question, there is also scope for exploiting task parallelism despite the apparently sequential nature of the computational components of the revised simplex method.
5.2.1 Sparse standard simplex method
It was observed above that the standard simplex method is generally implemented using dense matrix algebra. However, based on the observation that ll-in in the tableau can be low, Lentini et al. (1995) developed a parallel implementation of the standard simplex method with the tableau stored as a sparse matrix. In this remarkably novel approach, they incorporated a Markowitz-like criterion into column and row selection with the aim of reducing ll-in. When solving medium sized Netlib problems on four transputers they achieved a speed-up of between 0.5 and 2.7, with a super-linear speedup of 5.2 on one problem with a relatively large column-row ratio. They compared their results on eight transputers with a sparse revised simplex solver running on a mainframe (without giving any justication of the validity of this comparison) and found that their parallel solver was a little faster on smaller problems but slower on larger problems, particularly when solving four proprietary problems. Although they describe their results as disappointing, this brave and imaginative approach to parallelising the simplex method deserves to be better known.
5.2.2 Revised simplex method with a sparse inverse
When the revised simplex method with an explicit inverse is implemented, dense data structures are generally used to maintain B1. However, Shu (1995) identied that
B1 is by no means full and considered parallelising the revised simplex method with B1 maintained using a sparse data structure. At rst sight, her results using up to 64 processors on a Touchstone Delta, and up to 16 processors on an Intel iSPC/2 hyper-cube, appear good. Using the sparse representation of B1, as well as the traditional dense representation, Shu obtained a speed-up of up to 17 on the Touchstone Delta when solving small to modest Netlib problems. Unfortunately, for all but one of these problems the average density of B1 is between 33 and 47% so there is little difference in the solution times when using a sparse or dense B1. Of those problems for which comparative dense-sparse results are given, only in the case of SHIP08L is the average sparsity (6%) of B1 worth exploiting: the serial sparse implementation was twice as fast as the dense implementation and a speed-up of 16 was achieved, compared with a speed-up of 17 for the dense case. Results for signicantly larger problems, including some of the CRE and OSA problems from the Kennington test set (Carolan et al. 1990), are also given, although it is not clear how B1 was represented. Speed-ups of
123
154 J. A. J. Hall
up to 8 were achieved on both machines. However, for these larger problems, analysis by Shu (1995) of the computational cost of components of her implementation of the revised simplex method with a factored inverse indicated that PRICE, which parallelises easily, constitutes between 65 and 96% of the solution time. Although these proportions would be lower in the explicit inverse implementation, they are much higher than would be the case were Shu to have updated reduced costs and use a row-wise PRICE, exploiting the moderate to high levels of hyper-sparsity that these problems are now known to possess (Hall and McKinnon 2005). If the results for the problems where
PRICE dominates are excluded, the speed-ups obtained are in low single gures, even when using up to 64 processors.
5.2.3 Revised simplex method with a factored inverse
Efcient serial simplex solvers are based on the revised simplex method with a factored inverse since the practical LP problems whose solution poses a computational challenge are large and sparse. For such problems, the superiority of the revised simplex method with a factored inverse over the serial simplex techniques underlying the parallel schemes reviewed above is overwhelming. It follows that the only scope for developing a really worthwhile parallel simplex solver is to identify how the revised simplex method with a factored inverse may be parallelised.
The natural data parallelism of PRICE has been exploited by most authors. Some have then considered the data parallelism in other computational components, whereas others have studied the extent to which task parallelism can be exploited by overlapping operations that are then executed either in serial or, in the case of PRICE, by exploiting data parallelism.
5.2.3.1 Theoretical pure data parallel approaches In an early work, Pfefferkorn and Tomlin (1976) discussed how parallelism could be exploited in each computational component of the revised simplex method. This purely data-parallel scheme, devised for the ILLIAC IV, was never implemented. After observing the parallelism of a column PRICE operation, they then considered FTRAN and BTRAN. They identied that if a sequence E1s . . . E1r within the factored representation of B1 has the property that the pivots are in distinct rows and the eta vectors have no entries in these rows, then the application of these etas is independent in both FTRAN and BTRAN. This may be illustrated, without loss of generality, by assuming that r = 1, s = K and that the
pivots are in rows 1, . . ., K , in which case it is readily established that
E1K . . . E11 =
where Ei j = jK+i and M = diag{1, . . . , K }. The corresponding operations
E1K . . . E11r in FTRAN may now be performed as
rK := MrK ; rK := rK + E rK where r =
rK
rK
I
M
E I
I
123
Towards a practical parallelisation of the simplex method 155
and, in BTRAN, rTK := (rTK + rTK E)M. Note that for this technique to be effective
it is necessary for most of the (initial) components of rK to be nonzero or, equivalently, that most of the etas will have to be applied in the conventional view of FTRAN and
BTRAN. For LP problems exhibiting hyper-sparsity this does not occur so this technique will not be of value. For other problems, the assumption that such sequences exist to a signicant extent may be unreasonable. Any techniques developed by Pfefferkorn and Tomlin for exploiting parallelism during INVERT, which is based on
that of Tomlin (1972), are either not stated or unclear, possibly referring to the potential for parallelising the search for singletons in the triangularisation phase.
Helgason et al. (1988) discussed the scope for parallelising FTRAN and BTRAN
based on the quadrant interlocking factorisation of Evans and Hatzopoulos (1979). This relatively unknown factorisation scheme corresponds to forming an LU decomposition of (a row and column permutation of) B in which, along their diagonals, L has 22
blocks and U has 22 identity matrices. Thus the entire eta le following
INVERT
consists of sequences of the type identied by Pfefferkorn and Tomlin (1976), each of length two. It follows that if r lls in sufciently fast during FTRAN and BTRAN, there is the potential for speed-up bounded above by two. Developed with a shared memory machine in mind, this scheme was never implemented. Whilst the potential speed up of two is small, an extension of this scheme using larger blocks could have greater potential.
McKinnon and Plab (1997b) considered how data parallelism could be exploited in FTRAN for both dense and sparse right-hand sides. They also investigated how the Markowitz criterion could be modied in order to increase the potential for exploiting data parallelism in subsequent FTRAN operations (McKinnon and Plab 1997a).
Since the time when the parallelisation of the simplex method was rst and most actively considered, there have been huge efforts and advances in practical parallel factorisation and solution methods for sparse unsymmetric linear systems. However, the techniques thus developed have not yet been applied to the revised simplex method, the scope for which is discussed in Sect. 6.
5.2.3.2 Parallel PRICE for the dual simplex method Virtually all attempts to exploit parallelism in the simplex method have focused on the primal simplex method. The dual simplex method, which is very much more efcient for many single LP problems and is valuable when solving LP subproblems in branch-and-bound for integer programming, has computational requirements in each iteration that are very similar to those of the primal algorithm. There is, however, one important distinction between the primal and dual simplex methods. In the latter there is no variant corresponding to partial pricing in the primal simplex method whereby PRICE is restricted to just a subset of the nonbasic variables. For problems with very large column/row ratios, the overwhelmingly dominant cost of serial dual simplex solvers is PRICE. Hence its parallelisation can be expected to yield a signicant improvement in performance over that of efcient serial dual simplex solvers.
Bixby and Martin (2000) investigated the scope for parallelism in the dual simplex method and chose to parallelise only those operations whose cost is related to the number of columns in the problem, that is PRICE, the dual ratio test and update of the dual variables. Much the most expensive of these is PRICE in which, as in the primal
123
156 J. A. J. Hall
algorithm, the pivotal row is formed via the matrixvector product Tp N. Each process formed a part of Tp N, with the corresponding columns of N held row-wise so that sparsity in p can be exploited. The work of Bixby and Martin is also notable since the underlying dual simplex code being parallelised was CPLEX. Thus any speed-up was necessarily relative to a state-of-the-art serial solver. They implemented the dual simplex method on several architectures but, for the full Netlib set, the overhead of exploiting parallelism exceeded any gain in performance. This is unsurprising since few problems in the Netlib set have signicantly more columns than rows. Focusing on test problems with very large column/row ratios, using two Sun workstations connected by Ethernet they achieved very little speed-up. Performance using up to four processors on an IBM SP2 was better, with speed-up ranging from 1 to 3. Switching to experiments using two processors on shared memory machines, little speed-up was obtained with a Sun S20-502 but better results were obtained with a Silicon Graphics multiprocessor. In later work reported in this paper (Bixby and Martin 2000), the speedup achieved using four processors of a shared memory SGI Power challenge was better still, being near linear for most problems with more than 64 columns per row. Strangely this behaviour tailed off for the problems with most extreme numbers of columns per row.
5.2.3.3 Task and data parallel revised simplex The rst attempt to exploit task parallelism in the revised simplex method was reported by Ho and Sundarraj (1994). In addition to the natural data parallelism of (column-wise) PRICE, Ho and Sundarraj identied that INVERT can be overlapped with simplex iterations. Once formed, the representation of B10 is generally not up-to-date with respect to basis changes that occurred during INVERT. However, when using the product form update, a representation of B1 is readily obtained by incorporating the eta vectors formed since INVERT
was started. By dedicating one processor solely to INVERT, Ho and Sundarraj observe
that the frequency of INVERT should be increased. This can be expected to reduce the cost of FTRAN and BTRAN since the representation of B1 will involve fewer eta vectors. Whilst they claim that having fewer eta vectors will have advantages in terms of numerical stability, they do not discuss the potential for numerical instability identied earlier by Hall and McKinnon (1992). As can be expected for only a partially parallelised algorithm, the practical performance of Ho and Sundarrajs implementation, on an Intel iPSC/2 and Sequent Balance 8000, is limited in accordance with Amdahls law. On a set of small Netlib and proprietary problems, they report an average saving of 33% over the serial solution time.
To date, there have only been four implementations of the revised simplex method that have attempted to exploit parallelism anything like fully: one by Shu (1995), one by Wunderling (1996a,b) and two by Hall and McKinnon (1996, 1998). The most ambitious of the parallel simplex implementations developed by Shu (1995) remains the only pure data parallel implementation of the revised simplex method. It was based on a parallel triangularisation phase for INVERT, a distributed sparse LU decomposition of the basis matrix for parallel FTRAN and BTRAN, as well as parallel PRICE. She did not parallelise the application of the product form update etas, claiming (wrongly) that its parallelisation is inconceivable. Unfortunately no speed-up was achieved. For one
123
Towards a practical parallelisation of the simplex method 157
3
FTRAN
FTRAN
P1
P2
UT
P1
P2
UT
P1
P2
2
FTRAN
FTRAN
P1
P2
UT
P1
P2
UT
P1
P2
1
FTRAN
FTRAN
BTRAN StEd
P1
P2
UT
BTRAN StEd
P1
P2
UT
BTRAN StEd
P1
P2
0
C
FTRAN
FTRAN
R
BTRAN
P1
P2
UT
R
BTRAN
P1
P2
UT
R
BTRAN
P1
P2
INVERT
Fig. 4 Wunderlings parallel revised simplex with steepest edge
problem the solution time using two processors was the same as the serial time and, otherwise, the parallel solution time deteriorated rapidly.
Wunderlings variant of the revised simplex method was based on multiple pricing with steepest edge weights. The theoretical task parallelism is conveniently illustrated by the Gantt chart in Fig. 4. Wunderling did not perform explicit multiple pricing since all steepest edge weights were updated after each basis change. However, since the remaining tableau columns for attractive candidates were updated, the variant was equivalent to multiple pricing. Since INVERT was not overlapped with other calculations, the implementation was not quite fully parallel. If this serial operation is excluded then the implementation is fully parallel for two processors so long as there is not a signicant load imbalance between the two BTRAN operations. However, a good simplex solver will exploit the fact that the right-hand side of the standard update
BTRAN, Tp = eTp B1, contains only one nonzero (in a pivotal row), whereas it may
not be worth exploiting any sparsity in the right-hand side of the additional steepest edge BTRAN, w =Tq B1. Thus the parallelism of the
BTRAN operations may be
illusory.
It is unfortunate that the only published account of Wunderlings (1996b) very interesting work on parallel (and serial) simplex is his Ph.D. thesis, written in German. At a workshop presentation (Wunderling 1996a), he described his parallel simplex implementation and gave limited results. Wunderlings parallel scheme was implemented on a 2-processor Sun 20 and a Cray T3D using modules of Soplex, his efcient public-domain simplex solver. On general LP problems any performance improvement that was achieved was minimal and signicant speed-up was only achieved for problems with large column/row ratios.
The rst of Hall and McKinnons (1998) two parallel revised simplex schemes was ASYNPLEX. This corresponds to a variant of the revised simplex method in which reduced costs are computed directly rather than being updated and Dantzig pricing is used. Ideally, one processor is devoted to INVERT and the remaining processors perform simplex iterations. The theoretical operations of these processors are illustrated by the Gantt chart in Fig. 5. If a single path through the feasible region is to be followed then, clearly, two processors cannot perform basis changes simultaneously. An additional processor was introduced to act as basis change manager to ensure that a processor is only allowed to start CHUZR if its basis is up-to-date and no other processor has been allowed to start CHUZR at the current basis. Although each processor has reduced costs, more up-to-date reduced costs may be known on other processors. Thus, once a processor has computed the full set of reduced costs, a small set of good candidates is communicated to a further processor. This acts as a column selection
123
158 J. A. J. Hall
4
F
R
BTRAN
PRICE
C
FTRAN
R
BTRAN
PRICE
C
FTRAN
R
3
BTRAN
PRICE
C
FTRAN
R
BTRAN
PRICE
C
FTRAN
R
BTRAN
2
PRICE
C
FTRAN
R
BTRAN
PRICE
C
FTRAN
R
BTRAN
PRICE
1
C
FTRAN
R
BTRAN
PRICE
C
FTRAN
R
BTRAN
PRICE
C
F
0
INVERT
INVERT
Fig. 5 ASYNPLEX
manager and responds to requests made prior to FTRAN by the iteration processors, by communicating the best (known) candidate to enter the basis. Note that, due to the fact that basis changes were determined during the course of calculating reduced costs, column selection was generally performed using out-of-date reduced costs. These reduced costs became further out-of-date as a consequence of basis changes occurring during FTRAN.
ASYNPLEX was implemented on a Cray T3D using modules of Halls highly efcient revised simplex solver and tested using four modest but representative Netlib test problems. Using between 8 and 12 processors to perform simplex iterations, the iteration speed was increased by a factor of about 5 in all cases. However, the increase in the number of iterations required to solve the problem (resulting from the use of out-of-date reduced costs) led to the speed-up in solution time ranging from 2.4 to4.8. Although the speed-up obtained was felt to be satisfactory, it was limited by various factors. A further consequence of limited candidate persistence was the time wasted on candidates which, upon completion of FTRAN, were found to be unattractive. Efciency was also reduced by the overhead of communicating the pivotal column to all other iteration processors, together with the duplicated work of bringing the eta le up-to-date on all processors. Numerical instability, due to the need to re-use eta vectors from the previous representation of B1, was observed for many test problems.
Hall and McKinnons (1996) second parallel scheme was PARSMI. This was developed in an attempt to address the deciencies of ASYNPLEX (Hall and McKinnon 1998). In order to make realistic comparisons with good serial solvers, PARSMI updates the reduced costs and uses Devex pricing. To reduce the communication overheads, each processor is associated with one of the major computational components of the revised simplex method, with a variant of multiple pricing being used to increase the scope for parallelism. The operations performed on processors of different types and the inter-processor communications are illustrated in Fig. 6, where MI represents the processors performing minor iterations and MI_CT is the single processor
coordinating the MI processors.The FTRAN processors receive sets of attractive candidates to enter the basis from the
PRICE processors and apply the INVERT etas. The resulting partially completed tableau columns are distributed over the MI processors where the update etas are applied. The MI_CT processor determines the best candidate for which the up-to-date tableau column is known and coordinates CHUZR, which is distributed over the MI processes. Once the pivotal row has been determined, the MI_CT processor applies the update etas to start BTRAN. This is done very efciently using the technique described by
123
Towards a practical parallelisation of the simplex method 159
BTRAN
T
T
=
~ B 1
~
O(n)
O(P)
PRICE
O(10n)
B 1
INVERT MI_CT MI
cT
^ N c
p
:= ^
cq
c^N
T
T
^ N
p,q
Scanc^Q
O(1)
1
^ = EUaq
~
Form
B 1
~
T
Scan
O(1)
=
e
T
p
1
EU
~
CHUZR
aq
O(n)
O(10n)
B 1
O(1)
O(n)
=
aq
q
B 1
FTRAN
Fig. 6 Operations and inter-processor communications for PARSMI
MI MI_CT
R
U_FTR
R
U_FTR
R
U_FTR
R
U_FTR
R
FTRAN PRICE BTRAN
F
F
F
F
F
F
F
F
F
Unpack INVERT
F
F
P
C
PRICE
C
PRICE
C
PRICE
C
BTRAN
BTRAN
BTRAN
Unpack INVERT
BTRAN
INVERT
INVERT
SI
INVERT
2.97800
2.98400
2.99000
2.99600
3.00200
3.00800
C: CHUZC F: I-FTRAN U FTR: U-FTRAN R: CHUZR SI: Send INVERT
Fig. 7 Real processor activity and illustrative communication for PARSMI
Hall and McKinnon (2005). The resulting partially formed vector p is then communicated to a BTRAN processor which applies the INVERT etas before distributing the completed vector p across the PRICE processors. After a distributed PRICE operation, the PRICE processors update the reduced costs and Devex weights before determining sets of candidates and communicating them to the FTRAN processors. One or more
INVERT processors factorise the basis matrix serially and communicate the result to the FTRAN and BTRAN processors. Although presented sequentially, operations are overlapped as much as possible. For a six-processor implementation, the operation of the scheme is illustrated in Fig. 7. Note that this Gantt chart was created from actual data when applying the T3D implementation of PARSMI to the LP test problem 25FV47.
As with ASYNPLEX, the implementation of PARSMI was a highly complex programming task, magnied by fact that communication times are not deterministic and the order of arrival of messages determined the operation of the program.
123
160 J. A. J. Hall
Programming difculties, together with numerical instability (for the same reason as with ASYNPLEX) meant that the implementation was never reliable. In the very limited results that were obtained on modest Netlib problems, using eight processors the speed-up in iteration speed of between 2.2 and 2.4 was eroded by the consequences of limited candidate persistence to yield a speed-up in solution time of between 1.7 and 1.9.
5.2.4 The simplex method on supercomputers
From the results quoted by Beasley (1990) in a review of experience with the simplex method on Cray supercomputers, it may be concluded that these machines were used principally to study algorithmic performance when solving large LP problems, rather than to investigate the scope for exploiting their vector arithmetic facilities. When the latter was not used, the performance decreased by only a few tens of percent.
The most notable attempt to exploit a vector facility is reported by Forrest and Tomlin (1990). They vectorised (column-wise) PRICE by identifying those sets of columns of the constraint matrix for which the numbers of nonzeros are equal and at least 20, and storing the values and row indices in dense rectangular arrays. Using a set of moderate test problems, some of which were drawn from the Netlib set, they achieved speed-ups that are generally modest (less than 2), although for one problem the speed-up of 3.8 compares well with the gure of 4.5 which is the maximum attainable on the particular architecture. They did not consider the consequences of being sparse, in which case their vectorised PRICE would involve many operations with zero. For a sparse , a row-oriented PRICE avoids operations with zero but, since the numbers of nonzeros in N change from one iteration to another, it is hard to imagine their vectorisation technique being of value. Forrest and Tomlin go on to consider the extent to which a vector processor can be exploited in FTRAN, BTRAN and UPDATE. In
their serial revised simplex implementation, the latter has a signicant computational overhead since it modies the factors obtained by INVERT but, typically, yields a signicantly smaller eta le than would be obtained with the product form update. Estimated speed-ups of 1.5 and 2.0 for FTRAN, BTRAN and UPDATE are reported for two illustrative problems. As for other computational components, Forrest and Tomlin observe that when packing the pivotal column prior to CHUZR in easy problems (where the pivotal column contains a large proportion of zeros) much time is spent identifying nonzeros and identify that this operation vectorises well. In modern revised simplex solvers, exploiting such hyper-sparsity during FTRAN leads to list of indices of (potential) nonzeros being known so the full packing operation is unnecessary. The overall effect of exploiting a vector processor is a general speed-up of up to 4, with a speed-up of 12 for one problem, relative to a (then) state-of-the-art revised simplex solver.
5.2.5 Following parallel paths
An alternative approach to exploiting parallel computation when using the simplex method is to allow each processor to follow different paths through the feasible region and, periodically, determine which processor has made the best progress and then repeat the process from the corresponding basis. The simplest possible scheme of
123
Towards a practical parallelisation of the simplex method 161
this kind would correspond to a parallelisation of the maximum improvement rule for column selection. This is a criterion in which CHUZR is performed for all attractive candidates to enter the basis and the basis change performed is that which yields the greatest reduction in the objective. Unless it is performed within the context of the standard simplex tableau, as a variant for serial computation it is prohibitively expensive. However it parallelises naturally. A more practical approach that has been considered is to distribute the best candidates, one to each processor, and allow the simplex method to perform a xed number of basis changes before determining the process that has made the best progress. Whilst this is simple to implement and has minimal communication overhead, it cannot be expected to yield worthwhile speedup. One argument why is that, if the simplex method starts from a basis of logical variables and the optimal basis consists of structural variables, then the number of simplex iterations that must be performed is bounded below by m. Since efcient serial simplex solvers will, typically, determine the optimal basis in a small multiple of m iterations, this is an immediate bound on the speed-up that can be achieved. For some LP problems, typically those that are highly degenerate, the number of iterations required even by good solvers is a signicant multiple of m. However, even for such problems, it is doubtful that the number of iterations required would be reduced sufciently by any parallel path-following implementation.
Another approach to following parallel paths has been investigated by Maros and Mitra (2000). Rather than using the same variant of the revised simplex method to follow paths, Maros and Mitras strategy is to start from different edges at a common vertex and use eight different pricing strategies, including Dantzig, Devex, multiple pricing and several variants of partial pricing. They refer to this as a parallel mixed strategy. Although Devex pricing is commonly used as a default strategy, it is well known that for certain LP problems it is far from being optimal. This is particularly so for problems with large column/row ratios and Maros and Mitra give evidence of this. Further, it is also widely acknowledged that, during the course of solving a given LP problem, the best pricing strategy is likely to change. Early in the solution process, when the basis matrix contains a high proportion of logical variables, Devex may be signicantly less effective than partial or multiple pricing. Many efcient serial simplex solvers have a run-time pricing strategy that rst decides whether the column/row ratio is such that partial pricing it is expected to be more efcient and, if not, starts with some cheap pricing strategy before switching to a full pricing strategy such as Devex.
Maros and Mitra implemented their mixed strategy on a twelve processor Parsytec CC-12, using an extension of their efcient FortMP simplex solver and tested it on fteen of the largest Netlib problems. Note that the default pricing strategy for FortMP is dynamic cyclic partial multiple pricing. The mixed strategy is generally more efcient than the best single serial strategy (by an average factor of 1.19), and always more efcient than both the FortMP default strategy (by an average factor of1.69) and Devex (by an average factor of 1.56). Whilst the mixed strategy was more than three times faster than both the default strategy and Devex (for a different single LP problem in both cases), the average speed-up is modest. It would be interesting to compare the performance of the mixed strategy with a serial run-time mixed strategy such as that described above.
123
162 J. A. J. Hall
5.3 Parallelising the network simplex method
When the simplex method is applied to a minimum cost network ow problem, exploiting its structure yields the network simplex method. Although it exploits sparsity and structure very efciently, the data structures and algorithmic requirements for implementing the network simplex method are much simpler than those when implementing the revised simplex method for general LP problems. The parallelisation of the network simplex method has been considered by several authors, including Chang et al. (1988), Peters (1990) and Barr and Hickman (1994). The latter two are of interest since they identify the value of overlapping computational components, subject to sufcient coordination to ensure that the simplex method follows a single path through the feasible region. As such they predate the task-parallel approaches to the revised simplex method of Hall and McKinnon (1996, 1998) and Wunderling (1996a,b).
5.4 Parallelising extensions of the simplex method
The parallelisation of other simplex-like methods for solving LP problems has also been studied. These techniques offer more immediate scope for parallelisation but, in serial, are generally uncompetitive with a good implementation of the revised simplex method. They are also frequently only of relevance to LP problems which possess a particular, but common, structure. This is either inherent, as a consequence of the underlying model, or identied by analysing the constraint matrix.
In the technique of DantzigWolfe decomposition (Dantzig 1960) for block-angular problems with linking rows, the requirement to solve a (possibly large) number of independent smaller LP subproblems is a natural source of parallelism. This was identied by Ho et al. (1988) and load-balancing issues were addressed by Gnanendran and Ho (1993). The solution of problems with block-angular structure using parallel bundle methods is described by Medhi (1990).
For block-angular LP problems with linking columns, Benders (1962) decomposition may be used. The requirement to solve (nested) sets of independent LP problems has been exploited in a number of parallel implementations, for example Dempster and Thompson (1998), Hiller and Eckstein (1993) and Nielsen and Zenios (1997). Rosen and Maier (1990) describe a method for block-angular problems with linking rows in which, after forming the dual problem, a procedure similar to Benders decomposition yields a set of independent LP subproblems to be solved in parallel.
Work on identifying block-angular structure in general LP problems was reported by Pinar and Aykanat (1996) and more recent work is reported by Aykanat et al. (2004). Ferris and Horn (1998) discuss how block-angular structure may be identied in general LP problems prior to applying parallel bundle methods, for which they achieved an efciency of approximately 90% when using 32 processors of a Thinking Machines CM-5.
Klabjan et al. (2000) describe a primal-dual simplex algorithm in which several primal subproblems are solved in parallel. Their algorithm is only efcient for problems with very high column/row ratios and, for such problems, they achieve a speed-up of up to 4 on a cluster of 16 PCs.
123
Towards a practical parallelisation of the simplex method 163
5.4.1 Parallelising Kauls method
Bodurolu (1997) identied that, for general block-angular LP problems with linking rows, Kauls (1965) method appears to offer an attractive means of developing a parallel solver. It exploits the bordered block diagonal structure of the basis matrix
B =
B00 B01
B10 B11
, (5.1)
performing operations with B1 by maintaining invertible representations of the diagonal blocks of B11, the matrix B111B10, as well as the remaining entries of the standard simplex tableau corresponding to linking rows. In the simplest implementation, explicit inverses of the diagonal blocks of B11 are maintained and the structure of B111B10 is not exploited, allowing all the necessary values to be stored in dense arrays and updated using (at most) one outer-product per iteration. As in the revised simplex method, the pivotal row is computed from p, using a matrixvector product that exploits the block structure of the problem. What distinguishes Kauls method from the standard simplex method is that at least one dimension of each dense array is small relative to the size of the whole LP problem. Thus the memory overhead and computational requirement per iteration are vastly less.
A further attraction of Kauls method is that efcient pricing techniques are readily incorporated. Given his successful use of steepest edge (Goldfarb and Reid 1977) within his dense standard simplex solver (Eckstein et al. 1995), Bodurolu (1997) applied the approximate steepest edge technique of Crowder and Hattingh (1975) in computing weights by using the tableau slice corresponding to linking rows. However, any other edge weight pricing technique for the revised simplex method could be used. The scope for using such techniques sets Kauls method apart from DantzigWolfe decomposition, as does the fact that DantzigWolfe requires additional variables and may take considerably more iterations than Kauls method.
Bodurolu (1997) implemented Kauls method for the special case of generalised upper-bound (GUB) LP problems, where each block has one row. Using LP problems for which the proportion of GUB constraints ranged from 6 to 96% and running on a 32-node Connection Machine CM-5E, his implementation was faster than the Minos 5.4 serial sparse revised simplex solver.
5.5 Summary
Attempts to exploit parallelism in the simplex method have been associated with many of the leading gures in the development of efcient serial techniques for the revised simplex method. That the success relative to good serial implementations has been limited is a measure of the difculty of the task. None of these attempts has, for general large sparse LP problems, offered signicantly improved performance over a good serial implementation of the revised simplex method. Parallelising the standard or revised simplex method with dense matrix algebra is uncompetitive unless a massive number of processors is used. With sparse matrix algebra, parallelising just
123
164 J. A. J. Hall
the PRICE operation, even for LP problems with a high column/row ratio, may still be uncompetitive with the revised simplex method if the latter uses partial pricing. For problems with a high column/row ratio, parallelising PRICE yields a worthwhile performance improvement if the dual simplex method is preferred to the revised simplex method with partial pricing. The success of attempts to exploit task parallelism has been limited variously by lack of candidate persistence, numerical instability and the communication overheads of distributed memory multiprocessors.
6 Directions for the future
The period when most of the work on parallelising the simplex method took place ended in the late 1990s. Since then, a number of developments have taken place that affect the scope for future work in this area and its expected value.
The identication and exploitation of hyper-sparsity within the revised simplex method has led to huge improvements in the performance of serial revised simplex solvers. For hyper-sparse LP problems, the revised simplex method is preferable to interior point methods so any parallel implementation that offers signicant speedup over a good serial solver will yield advances at the cutting edge of LP solution techniques.
An important area of computational linear algebra that has seen much activity over the past decade or so has been parallel solution techniques for general unsymmetric sparse linear systems. This has taken place since most of the work on parallelising the simplex method so it is unsurprising that the direction of pure data-parallel sparse simplex was not considered. For problems with a signicant bump during INVERT, the lack of techniques for parallelising its factorisation had led to INVERT being viewed as inherently serial. This is no longer the case. Similarly, techniques for exploiting parallelism when applying a sparse factored inverse, even with a sparse right-hand side, have also been developed and it can be expected that they will facilitate the parallelisation of FTRAN and BTRAN.
Since the late 1990s, parallel computing hardware has developed greatly. There have been advances in the availability of shared memory multiprocessors, from multi-core PCs to large high performance computing resources. Massive distributed memory machines have ever larger numbers of processors and total memory.
Against the backdrop of these events, there have been no signicant advances in parallelising the simplex method. The scope for exploiting these developments, as well as other novel research avenues, is explored below.
6.1 The standard simplex method
Although the parallelisation of the standard simplex method with dense matrix algebra has been fully explored on modest numbers of processors and has been implemented successfully using large numbers of processors, exploiting modern massively parallel resources offers several major challenges. Implementations must be numerically stable, highly scalable and robust with respect to hardware faults. Techniques used by efcient serial simplex solvers to reduce the number of iterations required to solve a
123
Towards a practical parallelisation of the simplex method 165
given LP problem should also be incorporated (as has been done in several previous implementations). For several reasons, developing a massively parallel implementation of the standard simplex method with these properties and using dense matrix algebra would be worthwhile as an exercise in parallel programming. However, as established in the analysis below, for most practical LP problems, alternative methods of solution would be very much more efcient.
For large LP test problems (with mn 1011), to store the standard simplex
tableau as a full array would require of the order of 1TB of memory. This is available on many of the worlds current top 500 computer systems and, at their maximum practical performance, few of them would be capable of performing 100 iterations per second if each tableau entry were updated every iteration. Since the number of iterations required to solve LP problems of this size would be of order 105, the total solution time for most large sparse test problems would be no better than a good serial revised simplex implementation. Very few of the worlds top 500 computer systems would be able to store the tableau as a full array for problems with ten times as many variables and constraints in order to solve LP problems that, if sparse, could still be expected to be solved by a good serial solver. Thus a massively parallel implementation of the standard simplex method would not be expected to offer the opportunity to solve larger sparse problems than could be solved by a serial revised simplex solver.
There are a few classes of slightly smaller sparse LP problems for which ll-in makes the memory requirement of serial revised simplex and interior point solvers prohibitive. For these problems, as well as large dense LP problems, a robust and numerically stable parallel implementation of the standard simplex method would be of value. Note that a dense interior point solver would have a practical computational cost of O((m3+m2n) log n) so cannot be expected to compete with the O(m2n+mn2)
cost of the dense standard simplex method unless n m.
For hyper-sparse LP problems, the proportion of nonzeros in standard simplex tableau columns ranges from an average of 10% to several orders of magnitude less. If the tableau update does not test whether the multiplier of the pivotal row is zero then an overwhelming majority of the oating point operations add zero so should be avoided. While skipping the update if the multiplier is zero leads to an immediate and signicant serial performance improvement, for a parallel implementation it may have an impact on scalability. An investigation of this and its impact on data distribution strategies would be interesting.
As identied earlier, the work of Lentini et al. (1995) represents the only study of the parallel sparse standard simplex method. For the LP problems they considered, the average density of the optimal tableau was found to be about 10%. For large problems, such a density would make the serial sparse standard simplex method prohibitively expensive and only with a massive number of parallel processes could it conceivably compete with a good serial revised simplex solver. For large LP problems, even if they exhibit a very high degree of hyper-sparsity, the average tableau density still corresponds to a huge memory overhead. However, developing a massively parallel implementation of the sparse standard simplex method would represent a worthwhile challenge and would facilitate further study of hyper-sparsity.
123
166 J. A. J. Hall
A massively parallel implementation of the standard simplex method would have to be robust with respect to hardware faults. This could be achieved using the techniques described in Sect. 2.1 to address numerical stability.
6.2 Kauls method for block-angular LP problems
The observations of Bodurolu (1997) on Kauls method provide a platform for the development of a scalable parallel solver for large block-angular LP problems with linking rows. This is the subject of current research by Bodurolu, Hall and Hogg.
Since the approximate steepest edge strategy used by Bodurolu (1997) is based on exact norms of only the entries in the linking rows, these may not be sufciently good measures to yield the numerical stability and reduction in the number of iterations that was achieved by Bodurolus dense tableau simplex solver (Eckstein et al. 1995). It may be preferable to use Devex (Harris 1973) or, if necessary for numerical stability, accept the signicant additional computation and communication overhead of exact steepest edge by implementing the updates described by Goldfarb and Reid (1977). Although the diagonal blocks of the basis matrix in Kauls method are updated so that they remain nonsingular, pivoting with columns in the left border on grounds of numerical stability is not discussed. This was not an issue within Bodurolus GUB LP solver (Bodurolu 1997), but may limit the numerical stability of a general solver.
When compared with the requirements of the standard simplex method, Kauls method is very much more efcient in terms of memory overhead and computational cost. However, in order to solve block-angular LP problems with larger numbers of linking rows and smaller numbers of sparse diagonal blocks, it may be preferable on grounds of greater computational efciency to store only the entries of B1 (rather than the whole of the tableau) corresponding to linking rows, and to represent the inverses of the diagonal blocks of the basis matrix (5.1) by exploiting their structure/sparsity. Indeed, to operate with B1, it is sufcient to maintain an invertible representation of the Schur complement B00 B01B111B10, at the cost of an additional operation with B111. Note that if B1 is represented but none of the standard simplex tableau, the solution procedure reduces to a structure-exploiting variant of the revised simplex method and loses any Kaul-ness. The reductions in memory requirement achieved in this way may also be necessary if such large, sparse problems are to be solved using a practical number of processors.
6.3 Revised simplex with a factored inverse
To date, the brave but unsuccessful work by Shu (1995) represents the only real attempt to a develop anything like a fully data-parallel implementation of the revised simplex method using a sparse factored inverse. Following the major advances since this time, the state of the art in parallel techniques for the factorisation of general sparse matrices is represented by, for example, SuperLU from Berkeley (Demmel et al. 1999; Li and Demmel 2003) and MUMPS from ToulouseLyon (Amestoy et al. 2000, 2001, 2006). Many features distinguish the factorisation and linear system solution requirements in the revised simplex method. The fact that the matrix being factored is often highly
123
Towards a practical parallelisation of the simplex method 167
reducible can be exploited in many ways. It has been shown by Amestoy et al. (2007a,b) that general preprocessing and reordering techniques can combine numerical accuracy and reducibility detection in the context of general sparse solvers. For hyper-sparse LP problems, the overwhelming cost of INVERT is the triangularisation phase and results reported by Hall (2007) for a parallel triangularisation scheme are very encouraging. With the fruits of these advances as yet un-tapped in the context of the revised simplex method, the time is right for more work in this area.
The task parallel scheme SYNPLEX (Hall 2005) is a variant of PARSMI developed by Hall that is numerically stable. This has been achieved by overlapping INVERT with multiple BTRAN, multiple PRICE, CHUZC and multiple FTRAN. SYNPLEX has the added advantage of being algorithmically independent of both the number of processors and the time required to transmit data between them. A prototype implementation by Hall on a shared memory Sun Fire E15k achieved a speed-up of between 1.8 and 3.0 when using eight processors. The limitations of ASYNPLEX and PARSMI due to lack of candidate persistence have been reduced but cannot be eliminated. Load balancing problems when overlapping INVERT have motivated, and should be alleviated by, the development of a parallel INVERT or the incorporation of a general asymmetric parallel factorisation routine.
7 Conclusions
Despite its value in both the academic and commercial worlds, there has been no parallelisation of the simplex method that offers signicantly improved performance over a good serial implementation of the revised simplex method for general large sparse LP problems. The successes in terms of speed-up typically correspond to parallelisation of implementations that are partly or wholly inefcient in serial.
Attention has mainly been directed at inefcient simplex and simplex-like techniques in which exploiting parallelism is relatively straightforward. There have been only a few discussions of schemes by which parallelism might be exploited by (variants of) the revised simplex method with a factored inverse. Fewer still of these schemes have been implemented and none has met with any real success when applied to a full range of LP problems.
Despite the large amount of work reviewed in Sect. 5, the reference list in this paper is not exhaustive. Articles that stand out as best representing the state of the art are those by Bixby and Martin (2000), Eckstein et al. (1995), Forrest and Tomlin (1990), Lentini et al. (1995) and Shu (1995). At the suggestion of one of the referees of this paper, the work of Hall and McKinnon (1996, 1998) should be added to this list which, together with this paper, should be viewed as required reading for anyone contemplating the exploitation of parallelism in the simplex method.
In Sect. 6, several promising avenues for future work are identied. Without underestimating the challenge of implementing them, the rewards for any success are now all the greater since the identication of hyper-sparsity has given the revised simplex method renewed prominence. The demand for efcient simplex solvers capable of exploiting the increasingly widespread multi-core PC architectures adds further impetus to research in this area. The time is ripe for further work on developing parallel simplex method implementations of practical value.
123
168 J. A. J. Hall
Acknowledgments The author would like to thank Patrick Amestoy for his insight into the developments in techniques for parallel factorisation and solution techniques for general unsymmetric sparse linear systems, and the potential for their application to the revised simplex method. Comments and suggestions from the referees have also been very valuable and have improved both the content and focus of the paper.
References
Agrawal A, Blelloch GE, Krawitz RL, Phillips CA (1989) Four vectormatrix primitives. In: ACM symposium on parallel algorithms and architectures, pp 292302
Amdahl GM (1967) Validity of the single-processor approach to achieving large scale computing capabilities. In: AFIPS conference proceedings, vol 30. AFIPS Press, Reston, pp 483485
Amestoy PR, Duff IS, LExcellent J-Y (2000) Multifrontal parallel distributed symmetric and unsymmetric solvers. Comput Methods Appl Mech Eng 184:501520
Amestoy PR, Duff IS, Koster J, LExcellent J-Y (2001) A fully asynchronous multifrontal solver using distributed dynamic scheduling. SIAM J Matrix Anal Appl 23(1):1541
Amestoy PR, Guermouche A, LExcellent J-Y, Pralet S (2006) Hybrid scheduling for the parallel solution of linear systems. Parallel Comput 32(2):136156
Amestoy P, Li XS, Ng EG (2007a) Diagonal Markowitz scheme with local symmetrization. SIAM J MatrixAnal Appl 29(1):228244
Amestoy P, Pralet S, Li XS (2007b) Unsymmetric orderings using a constrained Markowitz scheme. SIAMJ Matrix Anal Appl 29(1):302327
Aykanat C, Pinar A, atalyrek V (2004) Permuting sparse rectangular matrices into block-diagonal form.SIAM J Sci Comput 25(6):18601879
Babaev DA, Mardanov SS (1991) A parallel algorithm for solving linear programming problems. Zh
Vychislitelnoi Matematiki Matematicheskoi Fiziki 31(1):8695Badr E-S, Moussa M, Papparrizos K, Samaras N, Sifaleras A (2006) Some computational results on MPI parallel implementations of dense simplex method. Trans Eng Comput Technol 17:228231Barr RS, Hickman BL (1994) Parallel simplex for large pure network problems: computational testing and sources of speedup. Oper Res 42(1):6580Beasley JE (1990) Linear programming on Cray supercomputers. J Oper Res Soc 41(2):133139 Benders JF (1962) Partitioning procedures for solving mixed-variables programming problems. Numer
Math 4:238252
Bixby RE (2002) Solving real-world linear programs: a decade and more of progress. Oper Res 50(1):315 Bixby RE, Martin A (2000) Parallelizing the dual simplex method. INFORMS J Comput 12:4556 Bixby RE, Fenelon M, Gu Z, Rothberg E, Wunderling R (2000) MIP: theory and practice closing the gap. In:
Powell MJD, Scholtes S (eds) System modelling and optimization: methods, theory and applications. Kluwer, The Netherlands, pp 1949Bodurolu II (1997) Scalable massively parallel simplex algorithms for block-structured linear programs.
Ph.D. thesis, GSAS, Columbia University, New York
Boffey TB, Hay R (1989) Implementing parallel simplex algorithms. In: CONPAR 88. Cambridge, UK,Cambridge University Press, pp 169176
Carolan WJ, Hill JE, Kennington JL, Niemi S, Wichmann SJ (1990) An empirical evaluation of the KORBX algorithms for military airlift applications. Oper Res 38(2):240248Chang MD, Engquist M, Finkel R, Meyer RR (1988) A parallel algorithm for generalized networks. Ann
Oper Res 14:125145
Crowder H, Hattingh JM (1975) Partially normalized pivot selection in linear programming. Math ProgramStudy 4:1225
Cvetanovic Z, Freedman EG, Nofsinger C (1991) Efcient decomposition and performance of parallel PDE,
FFT, Monte-Carlo simulations, simplex, and sparse solvers. J Supercomput 5:1938Dantzig GB (1960) The decomposition principle for linear programs. Oper Res 8:101111Dantzig GB, Orchard-Hays W (1954) The product form for the inverse in the simplex method. Math Comput
8:6467
Demmel JW, Eisenstat SC, Gilbert JR, Li XS, Liu JWH (1999) A supernodal approach to sparse partial pivoting. SIAM J Matrix Anal Appl 20(3):720755Dempster MAH, Thompson RT (1998) Parallelization and aggregation of nested Benders decomposition.
Ann Oper Res 81:163187
123
Towards a practical parallelisation of the simplex method 169
Eckstein J, Bodurolu II, Polymenakos L, Goldfarb D (1995) Data-parallel implementations of dense simplex methods on the connection machine CM-2. ORSA J Comput 7(4):402416
Evans DJ, Hatzopoulos M (1979) A parallel linear system solver. Int J Comput Math 7:227238
Ferris MC, Horn JD (1998) Partitioning mathematical programs for parallel solution. Math Program 80:3561
Finkel RA (1987) Large-grain parallelism: three case studies. In: Jamieson LH, Gannon D, Douglas RJ
(eds) The characteristics of parallel algorithms. MIT Press, Cambridge pp 2163Forrest JJ, Goldfarb D (1992) Steepest-edge simplex algorithms for linear programming. Math Program
57:341374
Forrest JJH, Tomlin JA (1990) Vector processing in the simplex and interior methods for linear programming. Ann Oper Res 22:71100Gay DM (1985) Electronic mail distribution of linear programming test problems. Math Program Soc
COAL Newsl 13:1012
Gill PE, Murray W, Saunders MA, Wright MH (1989) A practical anti-cycling procedure for linearly constrained optimization. Math Program 45:437474Gnanendran SK, Ho JK (1993) Load balancing in the parallel optimization of block-angular linear programs.
Math Program 62:4167
Goldfarb D, Reid JK (1977) A practical steepest-edge simplex algorithm. Math Program 12:361371 Hall JAJ (2005) SYNPLEX: a task-parallel scheme for the revised simplex method. In: Contributed talk at the second international workshop on combinatorial scientic computing (CSC05), June 2005Hall JAJ (2007) Parallel basis matrix triangularisation for hyper-sparse LP problems. In: Contributed talk the IMA conference on numerical linear algebra and optimisation, September 2007Hall JAJ, McKinnon KIM (1992) Update procedures for the parallel revised simplex method. Technical report MSR 92-13, Department of Mathematics and Statistics, University of EdinburghHall JAJ, McKinnon KIM (1996) PARSMI: a parallel revised simplex algorithm incorporating minor iterations and Devex pricing. In: Waniewski J, Dongarra J, Madsen K, Olesen D (eds) Applied parallel computing, vol 1184. Lecture notes in computer science. Springer, Berlin, pp 6776Hall JAJ, McKinnon KIM (1998) ASYNPLEX: an asynchronous parallel revised simplex method algorithm. Ann Oper Res 81:2749Hall JAJ, McKinnon KIM (1999) Exploiting hyper-sparsity in the revised simplex method. Technical report
MS99-014. Department of Mathematics and Statistics, University of EdinburghHall JAJ, McKinnon KIM (2005) Hyper-sparsity in the revised simplex method and how to exploit it.
Comput Optim Appl 32(3):259283
Harris PMJ (1973) Pivot selection methods of the Devex LP code. Math Program 5:128Helgason RV, Kennington LJ, Zaki HA (1988) A parallelisation of the simplex method. Ann Oper Res
14:1740
Hiller RS, Eckstein J (1993) Stochastic dedication: designing xed income portfolios using massively parallel Benders decomposition. Manage Sci 39(11):14221438Ho JK, Lee TC, Sundarraj RP (1988) Decomposition of linear programs using parallel computation. Math
Program 42:391405
Ho JK, Sundarraj RP (1994) On the efcacy of distributed simplex algorithms for linear programming.Comput Optim Appl 3(4):349363
Kaul RN (1965) An extension of generalized upper bounding techniques for linear programming. Technical
Report ORC 65-27. In: Center OR, Berkley UC, San Fransisco, CAKlabjan D, Johnson EL, Nemhauser GL (2000) A parallel primal-dual simplex algorithm. Oper Res Lett
27:4755
Kumar V, Grama A, Gupta A, Karypis G (2003) Introduction to parallel computing: design and analysis of algorithms, 2nd edn. AddisonWesley, New YorkLentini M, Reinoza A, Teruel A, Guillen A (1995) SIMPAR: a parallel sparse simplex. Comput Appl Math
14(1):4958
Li XS, Demmel JW (2003) SuperLU_DIST: a scalable distributed-memory sparse direct solver for unsym-metric linear systems. ACM Trans Math Softw 29(2):110140Luo J, Reijns GL (1992) Linear programming on transputers. In: van Leeuwen J (ed) Algorithms, software, architecture. IFIP transactions A (computer science and technology). Elsevier, Amsterdam, pp 525534Luo J, Reijns GL, Bruggeman F, Lindeld GR (1991) A survey of parallel algorithms for linear programming. In: Deprettere EF, van der Veen A-J (eds) Algorithms and parallel VLSI architectures, vol B. Elsevier, Amsterdam, pp 485490
123
170 J. A. J. Hall
Markowitz H (1957) The elimination form of the inverse and its application to linear programming. ManageSci 3:255296
Maros I, Mitra G (2000) Investigating the sparse simplex algorithm on a distributed memory multiprocessor.Parallel Comput 26:151170
McKinnon KIM, Plab F (1997a) A modied Markowitz criterion to increase parallelism in inverse factors of sparse matrices. Technical report, Department of Mathematics and Statistics, University of Edinburgh McKinnon KIM, Plab F (1997b) An upper bound on parallelism in the forward transformation within the revised simplex method. Technical report, Department of Mathematics and Statistics, University of EdinburghMedhi D (1990) Parallel bundle-based decomposition algorithm for large-scale structured mathematical programming problems. Ann Oper Res 22:101127Mittelmann HD (2008) Benchmarks for optimization software. http://plato.asu.edu/bench.html
Web End =http://plato.asu.edu/bench.html . Accessed
February 2008
Nielsen SN, Zenios SA (1997) Scalable Benders decomposition for stochastic linear programming. ParallelComput 23:10691088
Orchard-Hays W (1968) Advanced linear programming computing techniques. McGraw-Hill, New York Peters J (1990) The network simplex method on a multiprocessor. Networks 20:845859Pfefferkorn CE, Tomlin JA (1976) Design of a linear programming system for the ILLIAC IV. Technical report SOL 76-8. Systems Optimization Laboratory, Stanford UniversityPinar A, Aykanat C (1996) An effective model to decompose linear programs for parallel solution. In:
Waniewski J, Dongarra J, Madsen K, Olesen D (eds) Applied parallel computing. Lecture notes in computer science, vol 1184. Springer, Heidelberg, pp 592601Rosen JB, Maier RS (1990) Parallel solution of large-scale, block-angular linear programs. Ann Oper Res
22:2341
Shu W (1995) Parallel implementation of a sparse simplex algorithm on MIMD distributed memory computers. J Parallel Distrib Comput 31(1):2540Stunkel CB (1988) Linear optimization via message-based parallel processing. In: International conference on parallel processing, vol III, August 1988, pp 264271Suhl UH, Suhl LM (1990) Computing sparse LU factorizations for large-scale linear programming bases.
ORSA J Comput 2(4):325335
Thomadakis ME, Liu J-C (1996) An efcient steepest-edge simplex algorithm for SIMD computers. In:International conference on supercomputing, pp 286293
Tomlin JA (1972) Pivoting for size and sparsity in linear programming inversion routines. J Inst MathsApplics 10:289295
Wunderling R (1996a) Parallelizing the simplex algorithm. ILAY workshop on linear algebra in optimzation,Albi
Wunderling R (1996b) Paralleler und objektorientierter simplex. Technical report TR-96-09, Konrad-Zuse-Zentrum fr Informationstechnik Berlin
Yarmish G (2001) A distributed implementation of the simplex method. Ph.D. thesis, Polytechnic University,Brooklyn, NY, March 2001
Zenios SA (1989) Parallel numerical optimization: current status and annotated bibliography. ORSAJ Comput 1(1)(Winter):2043
123
Springer-Verlag 2010