Non-blocking MPI Collective Operations

Non-blocking collective operations (short: NBC) are an extension to the MPI-2 standard to combine the benefits of non-blocking point-to-point operations with the advantages of collective communication.

Non-blocking point-to-point operation allows overlapping of communication and computation to use the common parallelism in modern computer systems more efficiently. This enables the user to use the CPU even during ongoing message transmissions at the network level.

Collective operations allow the user to simplify his code and to use well tested and highly optimized routines for common collective communication patterns. These collective communication patterns can be highly optimized for the underlying hardware and yield a much better performance as naive implementations. Unfortunately, all these operations are only defined in a blocking manner, which disables explicit overlap.

Our new proposal combines advantages of non-blocking point-to-point and collective communication to enable non-blocking collective communication which can be used to overlap optimized collective communication patterns.

The document [2] provides a formal definition and may act as an MPI-2 standard-extension proposal.

Why should I use them?

The communication can consume a huge part of the running time of a parallel application. The communication time in those applications can be addressed as overhead because it does not progress the solution of the problem in most cases (with exception of reduce operations). Using overlapping techniques enables the user to move communication and the necessary synchonization in the background and use parts of the original communication time to perform useful computation.

However, the application code and the algorithm have to support overlapping. The user must find independent parts of the calculation to perform during a message transmission which do not depend on the result of this transmission.

How can I use them?

There are many possibilities to use non-blocking collective operations. They are somehow similar to techniques known from non-blocking point-to-point communication. However, it depends on the problem and the implementation if nbcolls can decrease the running time of a parallel program. Some common scenarios are explained below.

   Independent Code If you are able to find code regions to overlap, you are lucky. You just start the non-blocking collective operation before that region and wwait for completion before you enter a critical region which needs the communicatoed data. It's all about data dependencies. An example for independent code, a 3D CG Poisson solver is available at CG Solver.
Loops Many loops can be optimized with a principle called double buffering and look-ahaed. This is comparable to a pipelined execution. You communicate the first element before the first loop iteration starts. Every loop iteration i communicates the next round's i+1data. The communication and calculation buffers are swapped at the end of each loop.
Other Approaches It might be a hard task to apply non-blocking collective communication to some algorithms where the two principles mentioned above can not be applied. Those algorithms usually require a more sophisticated scheme that is highly dependent on the computational problem. General guidlines for this kind of problems are to start the communication as soon as possible and to end it only directly before the data is needed.

Where can I get it?

A first prototypic implementation focuses on portability and is based on MPI-1. It runs on every parallel computer which supports MPI-1. It uses optimized point-to-point algorithms. It runs in non-threaded environments, which forces the user to progress the operations by calling a test function. The prototype of LibNBC is available as source code at LibNBC. A description of LibNBC is available in [3].

Are there any Examples available?

A first example, a 3D CG Poisson solver is available at CG Solver. A detailed description and benchmark results are available in [4].

References

SC07
[1] T. Hoefler, A. Lumsdaine, W. Rehm:
 Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI In Proceedings of the 2007 International Conference on High Performance Computing, Networking, Storage and Analysis, SC07, presented in Reno, USA, IEEE Computer Society/ACM, Nov. 2007, (acceptance rate 20%, 54/268)
IUCS-TR
[2] T. Hoefler, J. Squyres, G. Bosilca, G. Fagg, A. Lumsdaine, W. Rehm:
 Non-Blocking Collective Operations for MPI-2 Open Systems Lab, Indiana University. presented in Bloomington, IN, USA, School of Informatics, Aug. 2006,
IUCS-TR
[3] T. Hoefler, A. Lumsdaine:
 Design, Implementation, and Usage of LibNBC Open Systems Lab, Indiana University. presented in Bloomington, IN, USA, School of Informatics, Aug. 2006,
PARCO
[4] T. Hoefler, P. Gottschling, A. Lumsdaine, W. Rehm:
 Optimizing a Conjugate Gradient Solver with Non-Blocking Collective Operations Elsevier Journal of Parallel Computing (PARCO). Vol 33, Nr. 9, pages 624-633, Elsevier, ISSN: 0167-8191, Sep. 2007,