Copyright Notice:

The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

Publications of SPCL

A. Strausz, F. Vella, S. Di Girolamo, M. Besta, T. Hoefler:

 Asynchronous Distributed-Memory Triangle Counting and LCC with RMA Caching

(In Proceedings of the 36th IEEE Interational Parallel and Distributed Processing Symposium (to appear), Jun. 2022)

Publisher Reference


Triangle count and local clustering coefficient are two core metrics for graph analysis. They find broad application in analyses such as community detection and link recommendation. Current state-of-the-art solutions suffer from synchronization overheads or expensive pre-computations needed to distribute the graph, achieving limited scaling capabilities. We propose a fully asychronous implementation for triangle counting and local clustering coefficient based on 1D partitioning, using remote memory accesses for transferring data and avoid synchronization. Additionally, we show how these algorithms present data reuse on remote memory accesses and how the overall communication time can be improved by caching these accesses. Finally, we extend CLaMPI, a software-layer caching system for MPI RMA, to include application-specific scores for cached entries and influence the eviction procedure to improve caching efficiency. Our results show improvements on shared memory, and we achieve 14x speedup from 4 to 64 nodes for the LiveJournal1 graph on distributed memory. Moreover, we demonstrate how caching remote accesses reduces total running time by up to 73% with respect to a non-cached version. Finally, we compare our implementation to TriC, the 2020 graph champion paper, and achieve up to 100x faster results for scale-free graphs.


access preprint on arxiv:
download slides:

Recorded talk (best effort)



  author={AndrĂ¡s Strausz and Flavio Vella and Salvatore Di Girolamo and Maciej Besta and Torsten Hoefler},
  title={{Asynchronous Distributed-Memory Triangle Counting and LCC with RMA Caching}},
  booktitle={Proceedings of the 36th IEEE Interational Parallel and Distributed Processing Symposium (to appear)},