Fail-In-Place Network Design

In the paper [1], we define network fail-in-place based on the differentiation between "critical" and "non-critical" network component failures. A critical component failure disconnects all paths between two hosts, whereas a non-critical failure only disconnects a subset of all paths between two hosts. The network fail-in-place strategy is to repair critical failures only, but continue operation by bypassing non-critical failures. We explore in our practical study on artificial and real existing networks whether or not a fail-in-place strategy is a feasible approach for large-scale high performance and data center networks. This fail-in-place strategy is going to alter the future design process for HPC systems as well as the operation policies for the network.

Downloading the toolchain
fts.tgz

Installation
./simulate.py --setup
creates patched and optimized versions of ibsim/opensm/... in home directory under $HOME/simulation

Build simulation experiments
./simulate.py --build
creates faulty networks under $HOME/simulation/experiments

Start individual experiments
e.g., run $HOME/simulation/experiments/2d_mesh_s25_n256_l240_rs1/lnf/12/2d_mesh_s25_n256_l240_rs1_lnf_12_dfsssp_exchange.sh
simulates a MPI all-to-all on a 2d mesh with 12 link failures and DFSSSP routing

The fail-in-place toolchain was developed by Jens Domke at the Matsuoka Laboratory at Tokyo Institute of Technology and the scientific work was advised by Torsten Hoefler.

References

SC14
[2] J. Domke, T. Hoefler, S. Matsuoka:
 Fail-in-Place Network Design: Interaction between Topology, Routing Algorithm and Failures presented in New Orleans, LA, USA, Nov. 2014, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC14) (acceptance rate: 21%, 82/394)