TinyMPI: A virtualized MPI Implementation for Arm and X86
Motivation
Traditional MPI implementations run at most one MPI rank per CPU core. TinyMPI runs more than one MPI rank per CPU core, i.e., it oversubscribes the CPU, with the goal of achieving automatic computation-communication overlap: when one MPI rank blocks, TinyMPI switches to another and continues using the CPU.
We have used TinyMPI as a tool in the research effort of answering the question, "How many ranks exactly to start on each CPU core?", the results of which - in the form of a model - are presented here.
Performance Model
The number of MPI ranks that a virtualized MPI implementation launches per CPU core is referred to as virtualization ratio. We will also refer to this parameter as V. Intuitively, V plays an important role in the overall performance: if V=1, then no virtualization benefits can be seen at all; if V is too large, then the overhead of task switching and the increased volume of intra-node communication will outweigh all benefits. The project report contains a full description of our performance model.Implementation
Version | Date | Changes |
tinyMPI_1.0.tar.gz | Feb 18, 2019 | initial release |
References
[1] A. Nigay, T. Schneider, T. Hoefler: | ||
TinyMPI tasking prototype
Feb. 2019, |