SPCL_Bcast(COMM_WORLD) Archive



What: SPCL_Bcast is an open, online seminar series that covers a broad range of topics around parallel and high-performance computing, scalable machine learning, and related areas.


Who: We invite top researchers and engineers from all over the world to speak.


These are old SPCL_Bcast talks, including recordings. For current talks, go here


10 September, 2020 — Satoshi Matsuoka (RIKEN R-CSS)
9 AM Zurich, 4 PM Tokyo, 3 PM Beijing, 3 AM New York, 12 AM (midnight) San Francisco

Fugaku: The First 'Exascale' Supercomputer – Past, Present and Future

Abstract: Fugaku is the first 'exascale' supercomputer of the world, not due to its peak double precision flops, but rather, its demonstrated performance in real applications that were expected of exascale machines on their conceptions 10 years ago, as well as reaching actual exaflops in new breeds of benchmarks such as HPL-AI. But the importance of Fugaku is its "applications first" philosophy under which it was developed, and its resulting mission to be the centerpiece for rapid realization of the so-called Japanese 'Society 5.0' as defined by the Japanese S&T national policy. As such, Fugaku's immense power is directly applicable not only to traditional scientific simulation applications, but can be a target of Society 5.0 applications that encompasses conversion of HPC & AI & Big Data as well as Cyber (IDC & Network) vs. Physical (IoT) space, with immediate societal impact. In fact, Fugaku is already in partial operation a year ahead of schedule, primarily to obtain early Society 5.0 results including combatting COVID-19 as well as resolving other important societal issues. The talk will introduce how Fugaku had been conceived, analyzed, and built over the 10 year period, look at its current efforts regarding Society 5.0 and COVID, as well as touch upon our thoughts on the next generation machine, or "Fugaku NeXT".

Picture of Satoshi Matsuoka Bio: Satoshi Matsuoka is the director of RIKEN R-CCS, the top-tier HPC center in Japan which operates the K Computer and will host its successor Supercomputer Fugaku, and he is a Specially Appointed Professor at Tokyo Tech since 2018. He had been a Full Professor at the Global Scientific Information and Computing Center (GSIC), Tokyo Institute of Technology, since 2000, where he has been the leader of the TSUBAME series of supercomputers that have won many accolades such as world #1 in power-efficient computing. Satoshi Matsuoka also leads various major supercomputing research projects in areas such as parallel algorithms and programming, resilience, green computing, and convergence of Big Data/AI with HPC. He has written over 500 articles, chaired numerous ACM/IEEE conferences, and has won many awards, such as the ACM Gordon Bell Prize in 2011 and the highly prestigious 2014 IEEE-CS Sidney Fernbach Memorial Award.
Homepage


24 September, 2020 — Amir Gholami & Zhewei Yao (UC Berkeley)
6 PM Zurich, 1 AM (Friday) Tokyo, Midnight Beijing, 12 PM New York, 9 AM San Francisco

A Paradigm Shift to Second Order Methods for Machine Learning

Abstract: The amount of compute needed to train modern NN architectures has been doubling every few months. With this trend, it is no longer possible to perform brute force hyperparameter tuning to train the model to good accuracy. However, first-order methods such as Stochastic Gradient Descent are quite sensitive to such hyperparameter tuning and can easily diverge for challenging problems. However, many of these problems can be addressed with second-order optimizers. In this direction, we introduce AdaHessian, a new stochastic optimization algorithm. AdaHessian directly incorporates approximate curvature information from the loss function, and it includes several novel performance-improving features, including: (i) a fast Hutchinson based method to approximate the curvature matrix with low computational overhead; and (ii) a spatial/temporal block diagonal averaging to smooth out variations of second-derivate over different parameters/iterations. Extensive tests on NLP, CV, and recommendation system tasks, show that AdaHessian achieves state-of-the-art results, with 10x less sensitivity to hyperparameter tuning as compared to ADAM.
In particular, we find that AdaHessian:
(i) outperforms AdamW for transformers by 0.13/0.33 BLEU score on IWSLT14/WMT14, 2.7/1.0 PPL on PTB/Wikitext-103;
(ii) outperforms AdamW for SqueezeBert by 0.41 points on GLUE;
(iii) achieves 1.45%/5.55% higher accuracy on ResNet32/ResNet18 on Cifar10/ImageNet as compared to Adam; and
(iv) achieves 0.032% better score than AdaGrad for DLRM on the Criteo Ad Kaggle dataset.
The cost per iteration of AdaHessian is comparable to first-order methods, and AdaHessian exhibits improved robustness towards variations in hyperparameter values.

Picture of Amir Gholami Bio: Amir Gholami is a senior research fellow at ICSI and Berkeley AI Research (BAIR). He received his PhD from UT Austin, working on large scale 3D image segmentation, a research topic which received UT Austin’s best doctoral dissertation award in 2018. He is a Melosh Medal finalist, the recipient of best student paper award in SC'17, Gold Medal in the ACM Student Research Competition, as well as best student paper finalist in SC’14. Amir is a recognized expert in industry with long lasting contributions. He was part of the Nvidia team that for the first time made FP16 training possible, enabling more than 10x increase in compute power through tensor cores. That technology has been widely adopted in GPUs today. Amir's current research focuses on exa-scale neural network training and efficient inference.
Homepage
Picture of Zhewei Yao Zhewei Yao is a Ph.D. student in the BAIR, RISELab (former AMPLab), BDD and Math Department at University of California at Berkeley. He is advised by Michael Mahoney, and he is also working very closely with Kurt Keutzer. His research interest lies in computing statistics, optimization, and machine learning. Currently, he is interested in leveraging tools from randomized linear algebra to provide efficient and scalable solutions for large-scale optimization and learning problems. He is also working on the theory and application of deep learning. Before joining UC Berkeley, he received his B.S. in Math from Zhiyuan Honor College at Shanghai Jiao Tong University.
Homepage


8 October, 2020 — Rio Yokota (Tokyo Tech)
9 AM Zurich, 4 PM Tokyo, 3 PM Beijing, 3 AM New York, 12 AM (midnight) San Francisco

Distributed Deep Learning with Second Order Information

Abstract: As the scale of deep neural networks continues to increase exponentially, distributed training is becoming an essential tool in deep learning. Especially in the context of un/semi/self-supervised pretraining, larger models tend to achieve much higher accuracy. This trend is especially clear in natural language processing, where the latest GPT-3 model has 175 billion parameters. The training of such models requires hybrid data+model-parallelism. In this talk, I will describe two of our recent efforts; 1) second-order optimization and 2) reducing memory footprint, in the context of large-scale distributed deep learning.

Picture of Rio Yokota Bio: Rio Yokota is an Associate Professor at the Tokyo Institute of Technology. His research interest lie at the intersection of HPC and ML. On the HPC side, he has worked on hierarchical low-rank approximation methods such as FMM and H-matrices. He has worked on GPU computing since 2007 and won the Gordon Bell prize using the first GPU supercomputer in 2009. On the ML side, he works on distributed deep learning and second-order optimization. His work on training ImageNet in 2 minutes with second-order methods has been extended to various applications using second-order information.
Homepage


22 October, 2020 — Edgar Solomonik (UIUC)
6 PM Zurich, 1 AM (Friday) Tokyo, Midnight Beijing, 12 PM New York, 9 AM San Francisco

High Performance Tensor Computations

Abstract: Tensor decompositions, contractions, and tensor networks are prevalent in applications ranging from data modeling to simulation of quantum systems. Numerical kernels within these methods present challenges associated with sparsity, symmetry, and other types of tensor structure. We describe recent innovations in algorithms for tensor contractions and tensor decompositions, which minimize costs and improve scalability. Further, we highlight new libraries for (1) automatic differentiation in the context of high-order tensor optimization, (2) efficient tensor decomposition, and (3) tensor network state simulation. These libraries all build on distributed tensor contraction kernels for sparse and dense tensors provided by the Cyclops library, enabling a shared ecosystem for applications of tensor computations.

Picture of Edgar Solomonik Bio: Edgar Solomonik is an assistant professor in the Department of Computer Science at the University of Illinois at Urbana-Champaign. He was previously an ETH Zurich Postdoctoral Fellow at ETH Zurich and did his PhD at the University of California, Berkeley. He has received the DOE Computational Science Graduate Fellowship, ACM/IEEE-CS George Michael Memorial HPC Fellowship, the David J. Sakrison Memorial Prize, the Alston S. Householder Prize, the IEEE-CS TCHPC Award for Excellence for Early Career Researchers in High Performance Computing, the SIAM Activity Group on Supercomputing Early Career Prize, and an NSF CAREER award. His research focuses on high performance numerical linear algebra, tensor computations, and parallel algorithms.
Homepage


5 November, 2020 — Jidong Zhai (Tsinghua University)
9 AM Zurich, 5 PM Tokyo, 4 PM Beijing, 3 AM New York, 12 AM (midnight) San Francisco

Light-Weight Performance Analysis for Next-Generation HPC Systems

Abstract: Building efficient and scalable performance analysis and optimizing tools, for large-scale systems, is increasingly important both for the developers of parallel applications and the designers of next-generation HPC systems. However, conventional performance tools suffer from significant time/space overhead due to the ever-increasing problem size and system scale. On the other hand, the cost of source code analysis is independent of the problem size and system scale, making it very appealing for large-scale performance analysis. Inspired by this observation, we have designed a series of light-weight performance tools for HPC systems, such as memory access monitoring, performance variance detection, and communication compression. In this talk, I will share our expreience on building these tools through combining static analysis and runtime analysis and also point out the main challenges in this direction.

Picture of Jidong Zhai Bio: Jidong Zhai is a Tenured Associate Professor in the Computer Science Department of Tsinghua University. He is a recipient of Siebel Scholar, CCF outstanding doctoral dissertation Award, IEEE TPDS Award for Editorial Excellence, and NSFC Young Career Award. He was a Visiting Professor of Stanford University (2015-2016) and a Visiting Scholar of MSRA (Microsoft Research Asia) in 2013. His research interests include parallel computing, performance evaluation, compiler optimization, and heterogeneous computing. He has published more than 50 papers in prestigious refereed conferences and top journals including SC, PPOPP, ASPLOS, ICS, ATC, MICRO, NSDI, IEEE TPDS, and IEEE TC. His research received a Best Paper Finalist at SC14. He is the advisor of Tsinghua Student Cluster Team. The team led by him has achieved 9 international champions in student supercomputing challenges at SC, ISC, and ASC. In 2015 and 2018, the team led by him swept all three champions at SC, ISC, and ASC. He was a program co-chair of NPC 2018 and a program co-chair of ICPP PASA 2015 workshop. He served or is now serving TPC member of SC, ICS, PPOPP, IPDPS, ICPP, NAS, LCPC, and Euro-Par. He is the general secretary of ACM SIGHPC China. He is currently on the editorial boards of IEEE Transactions on Parallel and Distributed Systems (TPDS), IEEE Transactions on Cloud Computing (TCC), and Journal of Parallel and Distributed Computing.
Homepage


19 November, 2020 — Attend SC20

We encourage all Bcast attendees to attend this year's virtual Supercomputing conference!



3 December, 2020 — Jesper Larsson Träff (TU Wien)
9 AM Zurich, 5 PM Tokyo, 4 PM Beijing, 3 AM New York, 12 AM (midnight) San Francisco

Decomposing MPI Collectives for Exploiting Multi-lane Communication

Abstract: Many modern, high-performance systems increase the cumulated node-bandwidth by offering more than a single communication network and/or by having multiple connections to the network, such that a single processor-core cannot by itself saturate the off-node bandwidth. Efficient algorithms and implementations for collective operations as found in, e.g., MPI, must be explicitly designed for exploiting such multi-lane capabilities. We are interested in gauging to which extent this might be the case.
In the talk, I will illustrate how we systematically decompose the MPI collectives into similar operations that can execute concurrently on and exploit multiple network lanes. Our decomposition is applicable to all standard, regular MPI collectives, and our implementations' performance can be readily compared to the native collectives of any given MPI library. Contrary to expectation, our full-lane, performance guideline implementations in many cases show surprising performance improvements with different MPI libraries on different systems, indicating severe problems with native MPI library implementations. In many cases, our full-lane implementations are large factors faster than the corresponding library MPI collectives. The results indicate considerable room for improvement of the MPI collectives in current MPI libraries including a more efficient use of multi-lane capabilities.

Picture of Jesper Larsson Träff Bio: Jesper Larsson Träff is professor for Parallel Computing at TU Wien (Vienna University of Technology) since 2011. From 2010 to 2011 he was guest professor for Scientific Computing at the University of Vienna. From 1998 until 2010 he was working at the NEC Laboratories Europe in Sankt Augustin, Germany on efficient implementations of MPI for NEC vector supercomputers; this work led to a doctorate (Dr. Scient.) from the University of Copenhagen in 2009. From 1995 to 1998 he spent four years as PostDoc/Research Associate in the Algorithms Group of the Max-Planck Institute for Computer Science in Saarbrücken, and the Efficient Algorithms Group at the Technical University of Munich. He received an M.Sc. in computer science in 1989, and, after two interim years at the industrial research center ECRC in Munich, a Ph.D. in 1995, both from the University of Copenhagen.
Homepage


17 December, 2020 — Bingsheng He (NUS)
9 AM Zurich, 5 PM Tokyo, 4 PM Beijing, 3 AM New York, 12 AM (midnight) San Francisco

Large Graph Processing on Heterogeneous Architectures: Systems, Applications and Beyond

Abstract: Graphs are de facto data structures for many data processing applications, and their volume is ever growing. Many graph processing tasks are computation intensive and/or memory intensive. Therefore, we have witnessed a significant amount of effort in accelerating graph processing tasks with heterogeneous architectures like GPUs, FPGAs and even ASICs. In this talk, we will first review the literatures of large graph processing systems on heterogeneous architectures. Next, we present our research efforts, and demonstrate the significant performance impact of hardware-software co-design on designing high performance graph computation systems and applications. Finally, we outline the research agenda on challenges and opportunities in the system and application development of future graph processing.

Picture of Bingsheng He Bio: Dr. Bingsheng He is currently an Associate Professor and Vice-Dean (Research) at School of Computing, National University of Singapore. Before that, he was a faculty member in Nanyang Technological University, Singapore (2010-2016), and held a research position in the System Research group of Microsoft Research Asia (2008-2010), where his major research was building high performance cloud computing systems for Microsoft. He got the Bachelor degree in Shanghai Jiao Tong University (1999-2003), and the Ph.D. degree in Hong Kong University of Science & Technology (2003-2008). His current research interests include cloud computing, database systems and high performance computing. His papers are published in prestigious international journals (such as ACM TODS and IEEE TKDE/TPDS/TC) and proceedings (such as ACM SIGMOD, VLDB/PVLDB, ACM/IEEE SuperComputing, ACM HPDC, and ACM SoCC). He has been awarded with the IBM Ph.D. fellowship (2007-2008) and with NVIDIA Academic Partnership (2010-2011). Since 2010, he has (co-)chaired a number of international conferences and workshops, including IEEE CloudCom 2014/2015, BigData Congress 2018 and ICDCS 2020. He has served in editor board of international journals, including IEEE Transactions on Cloud Computing (IEEE TCC), IEEE Transactions on Parallel and Distributed Systems (IEEE TPDS), IEEE Transactions on Knowledge and Data Engineering (TKDE), Springer Journal of Distributed and Parallel Databases (DAPD) and ACM Computing Surveys (CSUR). He has got editorial excellence awards for his service in IEEE TCC and IEEE TPDS in 2019.
Homepage