libpack: Runtime compiled pack functions

Often data which has to be communicated between processes is non-contiguous in memory. For example in a simple time-stepping 2D stencil application, the global domain is represented by a matrix of size N x N, which gets partitioned in blocks of roughly equal size, one per process. After each timestep the data at the borders of process local arrays has to be exchanged with the processes which operate on neighboring blocks, as shown in the illustration below.

Use case for MPI Derived Datatypes: Array face exchanges

If we assume the process local blocks are stored in row-major order, the data exchanged in north-south direction is consecutive in memory, so sending it is trivial with any communication framework. However, the data exchanged in the east-west direction is not consecutive - in order to exchange those data items we either have to send many small messages or copy the smaller blocks into a consecutive temporary buffer before sending it (and reverse this process at the receiver side). Sending separate messages is inefficient for many networks, since they can only exploit their peak bandwidth for large messages.

Therefore many scientific codes pack such small data fragments by iterating over them and performing an explicit copy into a temporary buffer before sending. The problem with this approach is that it is not performance portable - while it might be true on some systems that the higher performance of large message transfers outweigh the higher overhead of many small messages it is not necessarily true for all machines to the same extent. Some modern interconnection networks and network APIs, such as Cray DMAPP for example offer hardware support for strided data transfers.

The MPI standard offers an interface to express such non-contiguous data access patterns: MPI Derived Datatypes (DDTs). Using those the programmer can specify any data layout and pass this description to an MPI send or receive function. The data layout in the stencil example can be described with a vector data type for example, as shown in the illustration above. Ideally, this would lead to performance portable code, as the MPI library would choose the optimal method of packing and sending the data with the available hardware. Even a "lazy" MPI implementation, which does not exploit such advanced hardware features and just packs the data into an internal temporary buffer should not be slower than manual packing, but reduces code complexity, as the programmer does not need to manage the temporary pack/unpack buffers (which can be quite involved, if non-blocking sends and receives are used) and the actual packing code can also be removed completely. Unfortunately reality is very different! In the illustration below we compare the performance of MPI_Pack to manual packing code from two important scientific applications (MILC and WRF) and the NAS LU benchmark, using Crays vendor MPI.

A comparison of manual packing performance with packing of MPI Derived Datatypes

As shown, manual packing is always faster, and the difference in performance is large, up to a factor if 10 for small data. Why is MPI so slow, compared with manual packing? The fundamental difference between those two approaches is that (in all MPI implementations we know of) MPI Derived Datatypes are
while the manual packing code is transformed into machine code at compile time of the program. At that point many of the values that influence the packing code are known to the compiler, so it can generate very efficient code, i.e., by unrolling small inner loops. The interpreter on the other hand has to be very general. Of course a compiler can, when compiling the MPI DDT interpreter code, perform code specialization (or the programmer can provide different variants on its own) but the amount of possible cases forces the compiler / programmer to choose a handful of variants. With libpack, we attempt to solve this problem, by compiling pack functions for MPI DDTs with a runtime compilation framework in LLVM. MPI DDTs have to be created (the data layout of the datatype has to be described) and the committed to the MPI library before they can be used. We generate machine code for the packing and unpacking of each datatype in the commit phase, when the datatype is used for packing or unpacking we can simply call the previously generated function through calling a function pointer. This plot shows that this approach is indeed effective and brings performance of MPI pack functions to the same level as manual packing, often we are even able to outperform manual packing.

A comparison of manual packing performance with packing of MPI Derived Datatypes with Cray MPI and libpack

Our libpack library can either be used directly as a C++ library or with the supplied MPI interposer, which captures all calls to MPI_Type_create and MPI_Send/Recv functions and internally calls the libpack functions as appropriate and uses the PMPI interface to send/recv the packed data. The libpack library is still under development, the latest versions can be always be downloaded from github. A tarball is also available here.