MPI Derived Datatype (Benchmark) Page

This page intends to advocate the use of MPI Derived Datatypes (DDT). DDTs are a very powerful mechanism to declare data access patterns to the MPI library which, in turn, can choose the best method for sending or receiving the structure. This mechanism is superior to manually packing and unpacking the data. However, early implementations have been suboptimal (and some still are) such that many users live with the assumption that DDTs are not useful.

On this page, I present several examples where DDTs improved the performance of parallel codes. The paper also presents negative examples, however, the benchmarks are available here such that vendors can go ahead and optimize for the expected case. Also, datatypes are complete in that one can express any arbitrary permutation. This means that the potential optimization space is very huge (combinatorial). One observation is that only some patterns are common to most applications. The provided application benchmarks on this page try to provide such patterns to implementers and thus steer the optimization in a useful direction.

Datatype Application Benchmarks

This page hosts two benchmarks for MPI datatypes. The first one is a simple parallel two-dimensional Fast Fourier Transformation (FFT) using FFTW in a 1-d decomposition. The second benchmark is a full application code (MIMD Lattice Computation, MILC) acting on a four-dimensional matrix. See README and LICENSE files in the top directory of both packages for details.

Download


Datatyped Applications

Robert Gerstenberger extended three applications to use datatypes during his stay at NCSA. The results can be found below:

Download
The results have been summarized in the datatype microbenchmark DDTBench and the publication [3].

Semi-Automatic Datatype Generation

Marc Snir and Torsten Hoefler co-advised Fredrik Kjolstad's Master's work on automatic datatype extraction from source codes using refactoring techniques. Fredrik's webpage has additional details. Fredrik converted the NAS parallel benchmarks version 3.2 (Fortran) packing loops to straight-forward C loop code and applied his tool to convert the C loops to MPI datatypes. The patches can be downloaded from his webpage and is mirrored here nas-datatype-patches.tgz - (59.87 kb).

Runtime compilation of pack/unpack functions for MPI DDTs

As shown in [3], MPI DDTs are often outperformed by manual packing. The main reason is that in current MPI implementations DDTs are interpreted at runtime, while the code for manual packing can be optimized/specialized by the compiler at the compile time of the application. We wrote a packing library which closes the performance gap between MPI DDTs and manual packing by utilizing runtime compilation techniques.

References

EuroMPI'10
[1] T. Hoefler, S. Gottlieb:
 Parallel Zero-Copy Algorithms for Fast Fourier Transform and Conjugate Gradient using MPI Datatypes Vol LNCS 6305, In Recent Advances in the Message Passing Interface (EuroMPI'10), presented in Stuttgart, Germany, pages 132--141, Springer, ISSN: 0302-9743, ISBN: 078-3-642-15645-8, Sep. 2010,
EuroMPI'11
[2] W. Gropp, T. Hoefler, R. Thakur, J. Larsson Träff:
 Performance Expectations and Guidelines for MPI Derived Datatypes Vol 6960, In Recent Advances in the Message Passing Interface (EuroMPI'11), presented in Santorini, Greece, pages 150-159, Springer, ISBN: 978-3-642-24448-3, Sep. 2011,
EuroMPI'12
[3] T. Schneider, R. Gerstenberger, T. Hoefler:
 Micro-Applications for Communication Data Access Patterns and MPI Datatypes Vol 7490, In Recent Advances in the Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012, Vienna, Austria, September 23-26, 2012. Proceedings, presented in Vienna, Austria, pages 121-131, Springer, ISBN: 978-3-642-33517-4, Sep. 2012, Invited to a journal special issue on top picks from EuroMPI'12.