libpack: Runtime compiled pack functions
Often data which has to be communicated between processes is non-contiguous in memory. For example in a simple time-stepping 2D stencil application, the global domain is represented by a matrix of size N x N, which gets partitioned in blocks of roughly equal size, one per process. After each timestep the data at the borders of process local arrays has to be exchanged with the processes which operate on neighboring blocks, as shown in the illustration below.
Use case for MPI Derived Datatypes: Array face exchanges
If we assume the process local blocks are stored in row-major order, the data exchanged in north-south direction is consecutive in memory, so sending it is trivial with any communication framework. However, the data exchanged in the east-west direction is not consecutive - in order to exchange those data items we either have to send many small messages or copy the smaller blocks into a consecutive temporary buffer before sending it (and reverse this process at the receiver side). Sending separate messages is inefficient for many networks, since they can only exploit their peak bandwidth for large messages.Therefore many scientific codes pack such small data fragments by iterating over them and performing an explicit copy into a temporary buffer before sending. The problem with this approach is that it is not performance portable - while it might be true on some systems that the higher performance of large message transfers outweigh the higher overhead of many small messages it is not necessarily true for all machines to the same extent. Some modern interconnection networks and network APIs, such as Cray DMAPP for example offer hardware support for strided data transfers.
The MPI standard offers an interface to express such non-contiguous data access patterns: MPI Derived Datatypes (DDTs). Using those the programmer can specify any data layout and pass this description to an MPI send or receive function. The data layout in the stencil example can be described with a vector data type for example, as shown in the illustration above. Ideally, this would lead to performance portable code, as the MPI library would choose the optimal method of packing and sending the data with the available hardware. Even a "lazy" MPI implementation, which does not exploit such advanced hardware features and just packs the data into an internal temporary buffer should not be slower than manual packing, but reduces code complexity, as the programmer does not need to manage the temporary pack/unpack buffers (which can be quite involved, if non-blocking sends and receives are used) and the actual packing code can also be removed completely. Unfortunately reality is very different! In the illustration below we compare the performance of MPI_Pack to manual packing code from two important scientific applications (MILC and WRF) and the NAS LU benchmark, using Crays vendor MPI.
A comparison of manual packing performance with packing of MPI Derived Datatypes
As shown, manual packing is always faster, and the difference in performance is large, up to a factor if 10 for small data. Why is MPI so slow, compared with manual packing? The fundamental difference between those two approaches is that (in all MPI implementations we know of) MPI Derived Datatypes are
A comparison of manual packing performance with packing of MPI Derived Datatypes with Cray MPI and libpack
Our libpack library can either be used directly as a C++ library or with the supplied MPI interposer, which captures all calls to MPI_Type_create and MPI_Send/Recv functions and internally calls the libpack functions as appropriate and uses the PMPI interface to send/recv the packed data. The libpack library is still under development, the latest versions can be always be downloaded from github. A tarball is also available here.