Content area
Software engineering studies have shown that programmer productivity is improved through the use of computational science integrated development environments (or CSIDE, pronounced "sea side") such as MATLAB. Scientists often desire to use high-performance computing (HPC) systems to run their existing CSIDE scripts with large data sets. ParaM is a CSIDE distribution that provides parallel execution of MATLAB scripts on HPC systems at large shared computer centers. ParaM runs on a range of processor architectures (e.g., x86, x64, Itanium, PowerPC) and its MPI binding, known as bcMPI, supports a number of interconnect architectures (e.g., Myrinet and InfiniBand). On a cluster at Ohio Supercomputer Center, bcMPI with blocking communication has achieved 60% of the bandwidth of an equivalent C/MPI benchmark. In this paper, we describe goals and status for the ParaM project and the development of applications in signal and image processing that use ParaM. [PUBLICATION ABSTRACT]
Int J Parallel Prog (2009) 37:91105 DOI 10.1007/s10766-008-0084-3
A Computational Science IDE for HPC Systems: Design and Applications
David E. Hudak Neil Ludban
Ashok Krishnamurthy Vijay Gadepally
Siddharth Samsi John Nehrbass
Received: 7 March 2008 / Accepted: 13 August 2008 / Published online: 24 September 2008 Springer Science+Business Media, LLC 2008
Abstract Software engineering studies have shown that programmer productivity is improved through the use of computational science integrated development environments (or CSIDE, pronounced sea side) such as MATLAB. Scientists often desire to use high-performance computing (HPC) systems to run their existing CSIDE scripts with large data sets. ParaM is a CSIDE distribution that provides parallel execution of MATLAB scripts on HPC systems at large shared computer centers. ParaM runs on a range of processor architectures (e.g., x86, x64, Itanium, PowerPC) and its MPI binding, known as bcMPI, supports a number of interconnect architectures (e.g., Myrinet and InniBand). On a cluster at Ohio Supercomputer Center, bcMPI with blocking communication has achieved 60% of the bandwidth of an equivalent C/MPI benchmark. In this paper, we describe goals and status for the ParaM project and the development of applications in signal and image processing that use ParaM.
D. E. Hudak (B) N. Ludban A. Krishnamurthy V. Gadepally S. Samsi
Ohio Supercomputer Center, 1224 Kinnear Road, Columbus, OH 43212, USA e-mail: [email protected]
N. Ludbane-mail: [email protected]
A. Krishnamurthy e-mail: [email protected]
V. Gadepallye-mail: [email protected]
S. Samsie-mail: [email protected]
J. NehrbassHigh Performance Technologies, Inc., 7765 Parkcreek Drive, Centerville, OH 45459, USA e-mail: [email protected]
123
92 Int J Parallel Prog (2009) 37:91105
Keywords High-level language High performance computing MPI PGAS
MATLAB
1 Introduction
The adoption of traditional computer science tools (i.e., C and FORTRAN programming with compilers such as gcc, UNIX debuggers such as gdb and prolers such as gprof) has been problematic for computational scientists. Mastering the complexity of these tools is often seen as a diversion of energy that could be applied to the study of the given scientic domain. Many computational scientists instead prefer computational science integrated development environments (or CSIDE, pronounced sea side), such as MATLAB, Maple or Mathematica.
The popularity of CSIDEs in desktop environments has lead to the creation of many scientic solutions (coded as sets of CSIDE commands commonly called scripts) and, more importantly, the training of many computational scientists in their use. It would be a great boon for computational science if those solutions and that expertise could be applied to the largest computational systems, also known as high performance computing (HPC) systems.
At the Ohio Supercomputer Center (OSC), we have developed a CSIDE distribution for HPC systems called ParaM. ParaM has been installed on a number of systems at OSC and other large shared computer centers. In this paper, we recap the advantages of CSIDEs for computational science, describe the design and performance of ParaM, and describe representative signal and image processing applications in ParaM.
2 Computational Science IDE (CSIDE)
A computational science integrated development environment (abbreviated as CSIDE and pronounced sea side) is dened as a suite of software tools, including (1) a numeric interpreter with high-level mathematics and matrix operations, (2) support for domain specic extensions (e.g., signal and image processing, control systems, operations research), (3) graphics and visualization capabilities, and (4) a common user interface (typically including an editor) for interacting with the various tools. Commercial examples of CSIDEs include MATLAB, Maple and Mathematica. Notable open-source examples exist online, including GNU Octave (extended with OctaveForge and GNUPlot), Scilab, and Python (extended with SciPy, NumPy, iPython and Matplotlib).
CSIDEs are used as alternatives to traditional programming languages such as FORTRAN and C. Advantages of using a CSIDE have been enumerated for educational purposes [12,13]. These advantages include the interactive nature of the interpreter and the tight integration between the various tools in the CSIDE. Specifically, interactivity provides intuitive debugging and incremental development support as well as run-time inspection of complex data that may need sophisticated analysis (such as visualization) for validation. Tight integration between CSIDE tools removes the burden on the computational scientist to congure, install and use a number of software packages and to manage their potentially complex interactions.
123
Int J Parallel Prog (2009) 37:91105 93
3 Extending CSIDEs to HPC Systems
As CSIDEs have spread in popularity, they have been applied to data sets of increasing scale. In order to handle the necessary data processing requirements, users have a number of options including running on faster hardware, recoding their scripts in C or FORTRAN, or parallelizing their CSIDE scripts. In order to preserve their existing development work and to stay within a familiar development environment, some users have written CSIDE scripts designed to work in parallel on a single data set. In order to support such parallel scripts, parallel interpreters have been developed. These parallel interpreters are typically composed of (1) a sequential interpreter from a CSIDE (2) a job control mechanism for launching multiple copies of the interpreter on multiple processors and (3) communication libraries for interpreters to exchange results. Software packages that provide parallel interpretation are informally referred to here as HPC CSIDEs. While not providing comprehensive CSIDE functionality on an HPC system, these parallel interpreters allow users to port scripts developed within a CSIDE to parallel execution environments such as HPC systems.
3.1 HPC CSIDE Examples
For the remainder of this paper, we will restrict our discussions of HPC CSIDEs to ones executing scripts similar to MATLAB. There are a large number of both commercial and open source packages available to provide parallel interpretation for MATLAB scripts. For example, twenty-seven separate parallel interpreter systems based on MATLAB have been catalogued [5] as of the time of this writing in addition to our own project, ParaM. However, many of these early MPI toolboxes for MATLAB written in C were tied to specic operating systems, specic versions of MPI or specic versions of MATLAB. In addition, many are no longer maintained. Regardless, these projects have inuenced commercial projects like Interactive Supercomputings Star-P and the Mathworks Distributed Computing Toolbox (DCT) and Distributed Computing Engine (DCE).
MatlabMPI [7] and pMatlab [3] are popular open source packages for parallel interpretation of MATLAB scripts. MatlabMPI supports a message passing programming model for MATLAB scripts. pMatlab supports a partitioned global address space (PGAS) programming model similar to those of Titanium [15], Unied Parallel C (UPC) [11] and Co-Array Fortran [10]. MatlabMPI and pMatlab provide an application programming interface (API) for message passing and PGAS constructs for native MATLAB data types. MatlabMPI and pMatlab are composed entirely of MATLAB scripts and can execute anywhere MATLAB does, including Microsoft Windows environments. However, MatlabMPIs message passing constructs are implemented via le read and write to a globally shared le system and incur a high latency for messages. In addition, MatlabMPIs job management functions are not integrated with the batch system environments (e.g., PBS and LSF) commonly found at large shared computer centers.
123
94 Int J Parallel Prog (2009) 37:91105
3.2 HPCS Findings on CSIDEs
Software engineering studies from the DARPA HPCS program (http://www.highproductivity.org/
Web End =http://www.highproductivity.org/ ) point to the need for a CSIDE that provides access to HPC resources. Carver et al. [4] describe a set of observations about HPC software development teams, including: (1) Development of science and code portability are of primary concern, (2) High turnover in development teams and (3) Iterative, agile development. Traditional CSIDEs provide interactive access, run-time data inspection and a higher-level command language (similar to a scripting language), which is typically easier for new developers to understand.
Wolter et al. [14] found that time to solution is a limiting factor for HPC software development. Use of a CSIDE often reduces the time to obtain the rst solution when compared to traditional FORTRAN or C programming. Wolter et al. [14] also found that steep learning curves associated with parallel tools inhibit their use by computation scientists. This is reminiscent of arguments supporting CSIDEs in sequential environments [12], namely, that computational scientists prefer an integrated environment with previously designed, well-understood interactions between tools.
Funk et al. [6] describe the relative development time productivity metric (RDTP). The RDTP metric is applied to various implementations of the HPC Challenge benchmarks [9]. A CSIDE (MATLAB extended with pMatlab) implementation scored higher than C+MPI for three of the four benchmarks. These data indicate that a CSIDE can improve parallel programmer productivity.
4 ParaM Design
4.1 Overview and Status
Research scientists at OSC had been using MatlabMPI [7] for parallel interpretation of MATLAB scripts. MatlabMPIs application programming interface (API) met the needs of these developers. However, the MatlabMPI package was written to be portable across machine environments from standalone desktop PCs to clusters. Our goal was to write a package that preserved the MatlabMPI API, but was tailored to a large, centrally administered, shared cluster at a major computer center. We sought to provide efcient, scalable communications to handle large problems; a simple, reliable job launching process; and to preserve portability of parallel MATLAB scripts. Maintaining portability included porting between interpreters (MATLAB and GNU Octave), between MatlabMPI and ParaM, and between computer centers with different batch environments and administrative policies. Additionally, we wanted to support processor architectures not supported by MATLAB (e.g., Itanium) and to leverage high-speed interconnects (e.g., Myrinet and InniBand).
A development team at OSC developed ParaM to address these goals. ParaM is a software distribution for parallel interpretation of explicitly parallel MATLAB interpreter scripts. ParaM extends the internally developed bcMPI package with a number of externally developed software packages. In order to support the MatlabMPI API, ParaM includes the MatlabMPI API as well as forward and backward compatibility
123
Int J Parallel Prog (2009) 37:91105 95
functions to allow application code to be moved between desktop and HPC environments with minimal modications. ParaM uses OpenMPI or MPICH for its transport layer and transparently makes use of those libraries support for Myrinet and InniBand to provide efcient communication. In order to provide simple, reliable job launching, ParaM includes a batch script generator that supports the OpenPBS and LSF batch systems. ParaM is capable of using GNU Octave to support alternative processor architectures. In addition, we developed an automated installer to ease installation. Note that ParaM does not yet include full CSIDE functionality (for example graphing or visualization support is not included), but could be extended to provide such functions.
4.2 bcMPI
bcMPI is a software library that implements a bridge design pattern between an interpreter and an MPI library (both the interpreter and choice of MPI library are independently interchangeable). Design goals included support for (1) either MATLAB or Octave interpreters, (2) different MPI implementations and (3) existing APIs for message passing, e.g., MatlabMPI [7] API and MATLABs Distributed Computing Toolkit (DCT) message passing API.bcMPIs software organization is presented in Fig. 1. The bcMPI library consists of three layers: an MPI API for parallel MATLAB scripts, interpreter to C bindings and a core library written in C. The MPI API interpreter layer is a collection of interpreter scripts directly called from the users parallel MATLAB script. In addition, the pMatlab PGAS library (see Sect. 4.3) has also been ported into the ParaM framework so that parallel MATLAB scripts that use pMatlab call the MPI API. Each MPI API call in turn calls a binding that is an interpreter-specic external interface function (MEX functions for MATLAB and OCT functions for Octave). The bindings provide direct access to interpreter data structures to avoid CPU and memory costs for serialization. The bcMPI core library is a C library supporting a collection of data types (matrices and structures) that correspond to MATLAB and Octave data types. The core library
Fig. 1 bcMPI software architecture
123
96 Int J Parallel Prog (2009) 37:91105
contains functions to serialize those data types to MPI data types and to exchange their values via MPI. The core library uses the MPI library as a low latency high bandwidth transport layer.bcMPI can be used to execute MatlabMPI scripts on a cluster in a large shared computer center environment. A developer can use MatlabMPI on a multicore PC or departmental cluster to interactively develop and debug their parallel interpreter scripts. Using bcMPI, the developer can then run these scripts on a large shared cluster with a high-performance interconnect. In this way, the investment in code development time is preserved. There are differences between the two APIs, e.g., MatlabMPI allows for alphanumeric MPI tags while bcMPI requires numeric tags. There are also differences in operational semantics, e.g., MPI_Send is a blocking send by default in bcMPI while nonblocking in MatlabMPI. In addition, bcMPI provides MPI native support for MPI communicators, barriers, broadcast and collective operations.
The layered design of bcMPI provided exibility in adding features. For example, bcMPIs MPI_Send function was initially mapped directly to the MPI_Send call in the MPI library. This resulted in an MPI_Send call that blocked, while MPI_Send in MatlabMPI performs in a nonblocking manner. We wanted to provide users the option of using blocking sends for efciency if desired, but provide nonblocking sends for compatibility with existing MatlabMPI scripts. The MPI_Buffer_attach function was added to bcMPI. bcMPIs MPI_Buffer_attach causes the core library to allocate a buffer for the MPI library to use. The MPI_Send function in the core library was updated to check for the existence of an allocated buffer and, if found, call a non-blocking MPI_Bsend in the MPI library. Parallel MATLAB scripts that require nonblocking sends need only insert an MPI_Buffer_attach call at the beginning of the script and MPI_Send calls will not block, provided the buffer is of sufcient size.
Utilizing traditional MPI libraries such as MPICH and OpenMPI requires experience with process launchers such as mpiexec as well as HPC queuing systems such as PBS or LSF. bcMPI includes a batch system utility customized at installation time for the given MPI library and HPC system. The batch system utility was designed to abstract away the implementation details of a given HPC system and present a consistent interface to the end user.
4.3 pMATLAB/bcMPI
The pMatlab library [3] supports partitioned global address space (PGAS) programming for parallel MATLAB. pMatlab uses the MatlabMPI library for interprocessor communication. In order to improve the performance of pMatlab applications, bcMPI is used in place of MatlabMPI. Informally, this arrangement is known as pMatlab over bcMPI and abbreviated as pMatlab/bcMPI. Note that ParaM provides pMatlab/bcMPI in addition to bcMPI, so users can choose between (or combine) message passing and PGAS programming models. Also, the entire pMatlab API is preserved so that users may develop their scripts on PCs with the standalone pMatlab distribution, and then run them at a large shared computer center using ParaM.
123
Int J Parallel Prog (2009) 37:91105 97
4.4 ParaM Performance Evaluation
We wanted to evaluate the communication behavior of ParaM programs on a large shared cluster. C/MPI programs are the typical workload for such environments and so were selected for a baseline comparison. In order to compare the latency and bandwidth characteristics of ParaM relative to a traditional C/MPI program, ParaM equivalents of common MPI benchmarks were created (based on http://mvapich.cse.ohio-state.edu/benchmarks/
Web End =http://mvapich.cse. http://mvapich.cse.ohio-state.edu/benchmarks/
Web End =ohio-state.edu/benchmarks/ ). Latency was tested in a ping-pong fashion. Bandwidth was tested by having the sender transmit a xed number of back-to-back xed size messages and then receiving a single acknowledgement. The original C benchmarks and ParaM benchmarks were run on an AMD Opteron cluster at the Ohio Supercomputer Center with nodes containing dual socket, dual core 2.6 GHz Opteron processors,8 GB RAM and an InniBand interconnection network. Experiments were run with two nodes, each with exactly one active processor core per node. Experiments were conducted with message sizes ranging from 8B to 8MB. Results are presented in Figs. 2 and 3.
Communication overhead can be broadly categorized as either per-message overhead (incurred on each message regardless of its length) or per-byte overhead (i.e., proportional to the size of the message being transmitted). In the case of ParaM, per-message overhead includes using an interpreter (MATLAB or GNU Octave) while per-byte overhead includes any buffering or copying of the message (either at the MATLAB layer, the bcMPI layer or the MPI layer). By increasing the size of the message used in the test, we gain insight into the impact of per-byte overhead relative to per-message overhead, as tests with large messages amortize the per-message overhead across more data and, therefore, are more reective of per-byte overhead. For small messages, both latency and bandwidth are an order of magnitude worse for ParaM than the C/MPI program (for 8B messages, ParaMs latency is 10 times higher and its bandwidth is 22 times lower), showing a high per-message overhead for ParaM.
Fig. 2 Bandwidth benchmark results for bcMPI and MPI
123
98 Int J Parallel Prog (2009) 37:91105
Fig. 3 Latency benchmark results for bcMPI and MPI
For large messages, the latency and bandwidth are the same order of magnitude (for 8MB messages, ParaMs latency is 1.8 times higher and its bandwidth is 40% of the C/MPI bandwidth).
Choice of communication method also impacts the performance of ParaM programs. For example, the ParaM benchmarks used in the above comparison relied on the MatlabMPI-compatible version of MPI_Send. In this version, the benchmarks used the bcMPI MPI_Buffer_attach command and all MPI_Send commands were executed by bcMPI as calls to MPI_Bsend in the MPI library. In order to test the overhead of these buffered sends, we modied the ParaM latency and bandwidth benchmarks to use blocking sends that require no buffering. For small messages (8B), the two implementations (buffered versus blocking) were comparable. However, for large messages (8MB), the latency was similar (blocking latency was 12% lower) while bandwidth was significantly improved (blocking bandwidth was 55% higher). In summary, for 8MB messages, the C/MPI bandwidth benchmark reported 930MB/s, the (original) buffered send ParaM benchmark reported 360MB/s and the (improved) blocking send ParaM benchmark reported 558 MB/s, which is 60% of the C/MPI bandwidth. This demonstrates that significant bandwidth performance can be gained by using blocking sends and eliminating buffering. While it is counter to ParaMs goals to require MATLAB programmers to handle the additional complexity of blocking MPI_Send semantics, the bandwidth improvements should warrant attention by library and tool-box developers.
Another area where communication method has a significant impact is collective operations such as all-to-all and broadcast. In order to test the impact of communication method on all-to-all communication, we implemented one version of all-to-all using buffered sends and another version using blocking sends. When executing on N nodes, the buffered send implementation had each node perform N-1 MPI_Send commands followed by N-1 MPI_Recv commands. The blocking send implementation is a linear time communication algorithm in which the sending node is serialized
123
Int J Parallel Prog (2009) 37:91105 99
Fig. 4 All-to-all benchmark results
(the node with MPI rank 0 does N-1 MPI_Sends, followed by rank 1, etc.). There are more efcient all-to-all algorithms, but we selected this algorithm since it is simple and applicable to all node counts (not just powers of two). The buffered send implementation incurs the overhead of copying data, but overlaps communication between nodes. The blocking send implementation incurs the overhead of additional synchronization (only one message is transmitted at a time), but eliminates the additional copy.
We wrote a benchmark to calculate the sustained all-to-all rate (i.e., the number of all-to-all operations performed per second). The benchmark was run on the Opteron cluster described above for a varying numbers of nodes (one active processor core per node) and varying message sizes. The all-to-all rates were collected for both the blocking and buffered implementations. In order to compare the relative impact of copying overhead versus synchronization overhead, the performance of the two methods (blocking relative to buffered) is presented in Fig. 4. Note that for relatively small messages (4KB) the use of buffered sends increases all-to-all throughput and that the advantage of buffered sends increases as the number of nodes scales (for 4KB messages on 32 nodes, the blocking send implementation sustains 40% of the buffered sends). For large messages (1MB), all-to-all throughput increases with the use of blocking sends. This shows that all-to-all throughput is a dependent upon a variety of factors in the system and no one algorithm consistently produces better results. All-to-all communication is a candidate for case-by-case optimization where the buffering implementation is used for messages below a certain size and blocking for larger messages.
Finally, we wanted to evaluate the impact of moving communication algorithms out of the MATLAB layer and into the MPI library. We compared the implementation of the MPI_Bcast function provided in the MatlabMPI library with bcMPIs MPI_Broadcast function. MatlabMPIs MPI_Bcast maps to a series of MPI_Send and MPI_Recv calls in the MPI layer, while the MPI_Broadcast function results in a call to MPI_Bcast in the MPI layer. We tested broadcast rate (number of broadcasts per
123
100 Int J Parallel Prog (2009) 37:91105
Fig. 5 Broadcast benchmark results
second) in a manner similar to [8]. The tests were run on the Opteron cluster described above for a varying numbers of nodes (one active processor core per node) and varying broadcast message sizes. In order to compare the relative performance of broadcast in the MPI layer versus broadcast in the MATLAB layer, results are presented in Fig. 5. For the smallest messages sizes and smallest number of nodes, there is a marginal advantage (approximately 10%) to the MATLAB implementation. However, for any combination of larger message sizes or larger node counts, there is a clear advantage in moving communication activity into the MPI layer.
5 Application Development
5.1 Synthetic Aperture Radar (SAR) Image Processing
The Third Scalable Synthetic Compact Application (SSCA #3) benchmark [2], from the DARPA HPCS Program, performs Synthetic Aperture Radar (SAR) processing. SAR processing creates a composite image of the ground from signals generated by a moving airborne radar platform. It is a computationally intense process, requiring image processing and le I/O. In order to parallelize SSCA #3, the MATLAB proler was run on the serial implementation. The proler showed that approximately 67.5% of the time required for computation is spent in the image formation function of Kernel 1 (K1). Parallelization techniques were then applied to the function formImage in K1. Within formImage, the function genSARimage is responsible for the computationally intense task of creating the SAR image. genSARimage consists of two parts, namely, the interpolation loop and the 2D Inverse Fourier Transform. Both of these parts were parallelized through the creation of distributed matrices and then executed via pMatlab/bcMPI.
The source lines of code (SLOC) count is a common metric of code size, which is related to code complexity. The SLOC count of the serial MATLAB implementation
123
Int J Parallel Prog (2009) 37:91105 101
is 2,701 lines, while the SLOC count of the pMatlab implementation is 2852, thus introducing a 5.5% overhead in code complexity to parallelize 67.5% of the execution time of the application. A code example is presented in Fig. 6 showing the serial and parallel versions of one code segment from the function genSARimage. Figure 6 shows standard MATLAB code using matrix operators and arguments (for example, the zeros function is a matrix constructor which returns a matrix of all zeroes, the .* represents elementwise multiplication, etc.) and higher level functions (e.g., ifft performs an inverse fast Fourier transform). In the sequential code on the left of the gure, the matrix, F, needs to be divided among the processor cores to parallelize the computation. In the parallel code on the right of the gure, the upper shaded region shows the creation of a distributed version of the matrix, pF, which is distributed as contiguous blocks of columns across all processors. The code within the loop remains functionally equivalent, with the parallel version altered so that each processor core processes its local part of the global array. The lower shaded region shows a pMatlab transpose_grid [3] operation, which performs all-to-all communication to change pF from a column distributed matrix to a row distributed matrix in order to compute the following inverse FFT in parallel. Finally, in the lowest shaded region, the entire pF array is aggregated back on a single processor using the pMatlab agg command.
The ParaM implementation of the SSCA#3 benchmark was run on an AMD Opteron cluster at the Ohio Supercomputer Center with nodes containing dual 2.2 GHz Opteron processors, 4 GB RAM and an InniBand interconnection network. A matrix size of 1492 2296 elements was chosen, and the runs were conducted on 1,
2, 4, 8, 16, 24, and 32 processor cores with the InnBand interconnection network. Note that the single-core case time was generated using the original serial MATLAB code. The absolute performance time and observed speedup for image formation are given in Table 1.
Amdahls law [1] states that the maximum speedup of a parallel application is inversely proportional to the percentage of time spent in sequential execution. In our parallelization of SSCA #3, genSARimage, which accounted for 67.5% of overall execution time, was parallelized. The remaining execution time (32.5%) remains serial and, therefore, the theoretical speedup on p cores is 1/(0.325 + (0.675/p)). The maximum speedup is 1/0.325 3.0. The results in Table 1 show a close correlation between
theoretical and observed speedup as well as a maximum observed speedup of 2.62, which is approaching maximum theoretical speedup.
5.2 Acoustic Signal Processing
Researchers at the Ohio Supercomputer Center are using ParaM for a number of real world applications. For example, acoustic signal processing on a battleeld primarily involves detection and classication of ground vehicles, as depicted in Fig. 7. By using an array of active sensors, signatures of passing objects can be collected for target detection, tracking, localization and identication. One of the major components of using such sensor networks is the ability of the sensors to perform self-localization. Self-localization can be affected by environmental characteristics such as the terrain, wind speed, etc. A number of different algorithms are used to estimate the time of
123
102 Int J Parallel Prog (2009) 37:91105
%%%%%%%%%%%SERIAL CODE%%%%%%%%%%%%%%%%%%%%%%PARALLEL CODE%%%%%%%%%%%
pFmap = map([1 Ncpus], {}, [0:Ncpus-1]);
pFlocal_size = size(pFlocal);
kxlocal =kx(:,(my_rank*pFlocal_size(2)+1):(my_rank+1)*pFlocal_size(2));
KXlocal =KX(:,(my_rank*pFlocal_size(2)+1):(my_rank +1)*pFlocal_size(2));
fsmlocal=fsm(:,(my_rank*pFlocal_size(2)+1):(my_rank+1)*pFlocal_size(2));
m =length((my_rank*pFlocal_size(2)+1):(my_rank+1)*pFlocal_size(2));
F = single(zeros(nx, m));pF = zeros(nx,m,pFmap);
forii = 1:nforii = 1:n
icKX = round((kx(ii,:)-KX(1,1))/dkx) + 1; icKX = round((kxlocal(ii,:)-KXlocal(1,1))/dkx) + 1;
ikx = ones(I,1)*icKX
+[nInterpSidelobes:nInterpSidelobes]'*ones(1,m);
ikxlocal = ones(I,1)*icKX
+[nInterpSidelobes:nInterpSidelobes]'*ones(1,m);
ikx = ikx + nx*ones(I,1) * [0:m-1]; ikxlocal = ikxlocal + nx*ones(I,1) * [0:m-1];
nKX = KX(ikx); nKX = KXlocal(ikxlocal);
pFlocal = local(pF);
SINC = sinc((nKX - ones(I,1)*kx(ii,:))/dkx); SINC = sinc((nKX - ones(I,1)*kxlocal(ii,:))/dkx );
HAM = 0.54 + 0.46 * cos((pi/kxs)
*(nKX-ones(I,1)*kx(ii,:)));
HAM = 0.54 + 0.46 * cos((pi/kxs)
*(nKX-ones(I,1)*kxlocal(ii,:)));
pFlocal(ikxlocal) = pFlocal(ikxlocal) +(single(ones(I,1))
*fsmlocal(ii,:)) .* (SINC.* HAM);
endend
%%%2D IFFT%%% %%%2D IFFT%%%
spatial = fftshift(ifft(ifft(fftshift(F),[],2)));pFlocal = ifft(pFlocal, [],2);
pF = put_local(pF, pFlocal);
Z = transpose_grid(pF);
clear pF, pFlocal;
Zlocal = local(Z);
Zlocal = ifft(Zlocal, [],1);
Z = put_local(Z,Zlocal);
Z = agg(Z);
spatial = (Z);
F(ikx)= F(ikx) + (single(ones(I,1))
* fsm(ii,:)) .*(SINC .* HAM);
Fig.6pMatlabparallelizationofimageformation
123
Int J Parallel Prog (2009) 37:91105 103
Table 1 SSCA #3 results
Processor Data per Compute Theoretical Observed Fraction ofcores core time (s) speedup speedup maximum speedup
1 1492 2296 143.0 1.00 1.00 N/A
2 1492 1148 102.0 1.51 1.40 46%
4 1492 574 72.3 2.03 1.98 66%
8 1492 287 60.8 2.44 2.35 78%
16 1492 143 57.4 2.72 2.49 83%
24 1492 96 54.8 2.83 2.61 87%
32 1492 72 54.5 2.89 2.62 87%
Fig. 7 Vehicle signature identied via signal processing
arrival of the acoustic signals. If the number of analysis algorithms applied to the data is increased, the processing time increases correspondingly. In order to achieve improved response time, the data was processed in parallel. Since each data le can be processed independently of others, the initial parallelization approach was to split up processing of individual data les across multiple processors.
In recent experiments several gigabytes of data consisting of multiple audio les approximately 3 minute long were collected. Using the AMD Opteron cluster described in Sect. 5.1, ParaM was used to process a sample data set of 63 audio les (180 MB total data). The application consists of 50 MATLAB les with a total SLOC count of 2,835. The source modications required to parallelize the application under ParaM consisted of 25 lines of code (a SLOC increase of less than 1%). Performance results are included in Table 2, showing near-linear speedup up to 32 cores, where the sequential processing overhead of calculating summary statistics over all images becomes a significant component of total execution time.
The initial parallelization strategy is a task parallel solution that incurs no interprocessor communication (i.e., embarrassingly parallel). Such strategies are a valuable rst step for exploiting clusters and ParaM supports this approach. This shows the
123
104 Int J Parallel Prog (2009) 37:91105
Table 2 Acoustic signalprocessing results Cores Run time (s) Speedup
1 24654 12 12618 1.9 4 6390 3.8 8 3258 7.5 16 1650 14.9 32 934 26.3 64 798 30.8
value of ParaM beyond just bcMPI: by using ParaMs job control utilities, the application designers were able to rapidly port their sequential MATLAB application to a production cluster. The scalability of the embarrassingly parallel approach is limited since the number of input les (63 in this case) is an upper bound on the number of processors. In the next version, the developers plan to parallelize the individual signal processing routines in order to scale to larger numbers of processors. The developers can then use bcMPI to leverage the clusters interconnection network.
6 Conclusions and Future Work
An HPC CSIDE provides an integrated, high-level interface for data parallel mathematical operations. ParaM is a software distribution that provides a computational scientist with the ability to run explicitly parallel MATLAB scripts on large shared clusters. ParaMs message passing library, bcMPI, has a modular design that allows support for multiple interpreters (MATLAB and GNU Octave) and multiple MPI libraries (OpenMPI and MPICH) across a range of HPC systems. For large messages, bcMPI benchmarks obtain sixty percent of the bandwidth of similar C/MPI benchmarks. In addition, experiments show that ParaM applications are sensitive to synchronization and buffering overheads and benet greatly from leveraging MPI layer functionality, such as MPI_Bcast. Finally, we highlighted two applications ported to clusters using ParaM to demonstrate how application developers are using HPC CSIDEs.
There are multiple avenues for future work on ParaM. Communication algorithms can be selected based on problem parameters, such as the automatic selection of buffered all-to-all communication for small message sizes. The ability to enable advanced MPI features, such as one-sided operations, shows promise for improving the performance for pMatlab/bcMPI applications. Also, in order for computational scientists to develop parallel scripts that scale to large numbers of processors, user-friendly mechanisms for performance feedback must be provided.
Acknowledgements Thanks to Jeremy Kepner and the staff at MIT Lincoln Labs for providing a pMATLAB implementation of the HPC Challenge benchmarks. Thanks to Gene Whipps at the U.S. Army Research Laboratory and Randy Moses from The Ohio State University for the description of acoustic signature analysis.
123
Int J Parallel Prog (2009) 37:91105 105
References
1. Amdahl, G.: Validity of the single processor approach to achieving large-scale computing capabilities. In: AFIPS Conference Proceedings, vol. 30, pp. 483485 (1967)
2. Bader, D., Madduri, K., Gilbert, J., Shah, V., Kepner, J., Meuse, T., Krishnamurthy, A.: Designing scalable synthetic compact applications for benchmarking high productivity computing systems. CTWatch Quarterly, vol. 2, no. 4B, November 2006
3. Bliss, N., Kepner, J.: pMatlab Parallel Matlab Library. In: Kepner, J., Zima, H. (eds.) International Journal of High Performance Computing Applications: special Issue on High Level Programming Languages and Models, Winter 2006 (November)
4. Carver, J.C., Hochstein, L.M., Kendall, R.P., Nakamura, T., Zelkowitz, M.V., Basili, V.R., Post, D.E.: Observations about software development for high end computing. CTWatch Quarterly, vol. 2, no. 4A, November 2006
5. Edelman, A.: Parallel MATLAB Survey. http://www.interactivesupercomputing.com/reference/ParallelMatlabsurvey.htm
Web End =http://www.interactivesupercomputing.com/reference/ http://www.interactivesupercomputing.com/reference/ParallelMatlabsurvey.htm
Web End =ParallelMatlabsurvey.htm
6. Funk, A., Basili, V., Hochstein, L., Kepner, J.: Analysis of parallel software development using the relative development time productivity metric. CTWatch Quarterly, vol. 2, no. 4A, November 2006
7. Kepner, J., Ahalt, S.: MatlabMPI. J. Parallel Distrib. Comput. 64(8), 9971005 (2004). doi:http://dx.doi.org/10.1016/j.jpdc.2004.03.018
Web End =10.1016/j. http://dx.doi.org/10.1016/j.jpdc.2004.03.018
Web End =jpdc.2004.03.018
8. Liu, J., Mamidala, A., Panda, D.K.: Fast and scalable MPI-level broadcast using InniBands hardware multicast support. In: Intl Parallel and Distributed Processing Symposium (IPDPS 04), April 2004
9. Luszczek, P., Dongarra, J., Kepner, J.: Design and implementation of the HPC challenge benchmark suite. CTWatch Quarterly, vol. 2, no. 4A, November 2006
10. Numrich, R., Reid, J.: Co-array fortran for parallel programming. ACM Fortran Forum 17(2), 131 (1998)
11. UPC language specications, v1.2. Technical Report LBNL-59208, Berkeley National Lab (2005)12. Webb, P.: Response to Wilson: teach science and software engineering with Matlab. IEEE Comput. Sci. Eng. 4(2), 45 (1997). doi:http://dx.doi.org/10.1109/MCSE.1997.609824
Web End =10.1109/MCSE.1997.609824
13. Wilson, G.: What should computer scientists teach to physical scientists and engineers? IEEE Comput. Sci. Eng. 3(2), 4655 (1996). doi:http://dx.doi.org/10.1109/99.503313
Web End =10.1109/99.503313
14. Wolter, N., McCracken, M.O., Snavely, A., Hochstein, L., Nakamura, T., Basili, V.: Whats working in HPC: investigating HPC user behavior and productivity. CTWatch Quarterly, vol. 2, no. 4A, November 2006
15. Yelick, K., Hilnger, P., Graham, S., Bonachea, D., Su, J., Kamil, A., et al.: Parallel languages and compilers: perspective from the Titanium experience. Int. J. High Perform. Comput. Appl. 21(2) (2007)
123
Springer Science+Business Media, LLC 2009