TinyMPI: A virtualized MPI Implementation for Arm and X86

Motivation

Traditional MPI implementations run at most one MPI rank per CPU core. TinyMPI runs more than one MPI rank per CPU core, i.e., it oversubscribes the CPU, with the goal of achieving automatic computation-communication overlap: when one MPI rank blocks, TinyMPI switches to another and continues using the CPU.

We have used TinyMPI as a tool in the research effort of answering the question, "How many ranks exactly to start on each CPU core?", the results of which - in the form of a model - are presented here.

Performance Model

The number of MPI ranks that a virtualized MPI implementation launches per CPU core is referred to as virtualization ratio. We will also refer to this parameter as V. Intuitively, V plays an important role in the overall performance: if V=1, then no virtualization benefits can be seen at all; if V is too large, then the overhead of task switching and the increased volume of intra-node communication will outweigh all benefits. The project report contains a full description of our performance model.

Implementation

Version Date Changes
tinyMPI_1.0.tar.gz Feb 18, 2019 initial release

References

MB3
[1] A. Nigay, T. Schneider, T. Hoefler:
 TinyMPI tasking prototype Feb. 2019,