Non-blocking MPI Collective Operations
Non-blocking collective operations (short: NBC) are an
extension to the MPI-2 standard to combine the benefits of non-blocking
point-to-point operations with the advantages of collective
communication.
Non-blocking point-to-point operation allows overlapping of
communication and computation to use the common parallelism in modern
computer systems more efficiently. This enables the user to use the
CPU even during ongoing message transmissions at the network level.
Collective operations allow the user to simplify his code and to use
well tested and highly optimized routines for common collective
communication patterns. These collective communication patterns can be
highly optimized for the underlying hardware and yield a much better
performance as naive implementations. Unfortunately, all these
operations are only defined in a blocking manner, which disables
explicit overlap.
Our new proposal combines advantages of non-blocking point-to-point and
collective communication to enable non-blocking collective
communication which can be used to overlap optimized collective
communication patterns.
Why should I use them?
The communication can consume a huge part of the running time of a
parallel application. The communication time in those applications can
be addressed as
overhead because it does not progress the solution of the problem in
most cases (with exception of reduce operations). Using overlapping
techniques enables the user to move communication and the necessary
synchonization in the background and use parts of the original
communication time to
perform useful computation.
However, the application code and the algorithm have to support
overlapping. The user must find independent parts of the calculation
to perform during a message transmission which do not depend on the
result of this transmission.
How can I use them?
There are many possibilities to use non-blocking collective operations. They are somehow similar to techniques known from non-blocking point-to-point communication. However, it depends on the problem and the implementation if nbcolls can decrease the running time of a parallel program. Some common scenarios are explained below.Independent Code If you are able to find code regions to overlap, you are lucky. You just start the non-blocking collective operation before that region and wwait for completion before you enter a critical region which needs the communicatoed data. It's all about data dependencies. An example for independent code, a 3D CG Poisson solver is available at CG Solver. | |
Loops Many loops can be optimized with a principle called double buffering and look-ahaed. This is comparable to a pipelined execution. You communicate the first element before the first loop iteration starts. Every loop iteration i communicates the next round's i+1data. The communication and calculation buffers are swapped at the end of each loop. | |
Other Approaches It might be a hard task to apply non-blocking collective communication to some algorithms where the two principles mentioned above can not be applied. Those algorithms usually require a more sophisticated scheme that is highly dependent on the computational problem. General guidlines for this kind of problems are to start the communication as soon as possible and to end it only directly before the data is needed. |
Where can I get it?
A first prototypic implementation focuses on portability and is based on MPI-1. It runs on every parallel computer which supports MPI-1. It uses optimized point-to-point algorithms. It runs in non-threaded environments, which forces the user to progress the operations by calling a test function. The prototype of LibNBC is available as source code at LibNBC. A description of LibNBC is available in [3].Are there any Examples available?
A first example, a 3D CG Poisson solver is available at CG Solver. A detailed description and benchmark results are available in [4].References
[1] T. Hoefler, A. Lumsdaine, W. Rehm: | ||
Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI
In Proceedings of the 2007 International Conference on High Performance Computing, Networking, Storage and Analysis, SC07, presented in Reno, USA, IEEE Computer Society/ACM, Nov. 2007, (acceptance rate 20%, 54/268) |
[2] T. Hoefler, J. Squyres, G. Bosilca, G. Fagg, A. Lumsdaine, W. Rehm: | ||
Non-Blocking Collective Operations for MPI-2
Open Systems Lab, Indiana University. presented in Bloomington, IN, USA, School of Informatics, Aug. 2006, |
[3] T. Hoefler, A. Lumsdaine: | ||
Design, Implementation, and Usage of LibNBC
Open Systems Lab, Indiana University. presented in Bloomington, IN, USA, School of Informatics, Aug. 2006, |
[4] T. Hoefler, P. Gottschling, A. Lumsdaine, W. Rehm: | ||
Optimizing a Conjugate Gradient Solver with Non-Blocking Collective Operations
Elsevier Journal of Parallel Computing (PARCO). Vol 33, Nr. 9, pages 624-633, Elsevier, ISSN: 0167-8191, Sep. 2007, |