SPCL_Bcast(COMM_WORLD)
Archive
What: SPCL_Bcast
is an open, online seminar series that covers a broad range of topics around parallel and high-performance computing, scalable machine learning, and related areas.
Who: We invite top researchers and engineers from all over the world to speak.
These are old SPCL_Bcast
talks, including recordings. For current talks, go here
Fugaku: The First 'Exascale' Supercomputer – Past, Present and Future
Abstract: Fugaku is the first 'exascale' supercomputer of the world, not due to its peak double precision flops, but rather, its demonstrated performance in real applications that were expected of exascale machines on their conceptions 10 years ago, as well as reaching actual exaflops in new breeds of benchmarks such as HPL-AI. But the importance of Fugaku is its "applications first" philosophy under which it was developed, and its resulting mission to be the centerpiece for rapid realization of the so-called Japanese 'Society 5.0' as defined by the Japanese S&T national policy. As such, Fugaku's immense power is directly applicable not only to traditional scientific simulation applications, but can be a target of Society 5.0 applications that encompasses conversion of HPC & AI & Big Data as well as Cyber (IDC & Network) vs. Physical (IoT) space, with immediate societal impact. In fact, Fugaku is already in partial operation a year ahead of schedule, primarily to obtain early Society 5.0 results including combatting COVID-19 as well as resolving other important societal issues. The talk will introduce how Fugaku had been conceived, analyzed, and built over the 10 year period, look at its current efforts regarding Society 5.0 and COVID, as well as touch upon our thoughts on the next generation machine, or "Fugaku NeXT".
Homepage
A Paradigm Shift to Second Order Methods for Machine Learning
Abstract: The amount of compute needed to train modern NN architectures has been doubling every few months. With this trend, it is no longer possible to perform brute force hyperparameter tuning to train the model to good accuracy. However, first-order methods such as Stochastic Gradient Descent are quite sensitive to such hyperparameter tuning and can easily diverge for challenging problems. However, many of these problems can be addressed with second-order optimizers. In this direction, we introduce AdaHessian, a new stochastic optimization algorithm. AdaHessian directly incorporates approximate curvature information from the loss function, and it includes several novel performance-improving features, including: (i) a fast Hutchinson based method to approximate the curvature matrix with low computational overhead; and (ii) a spatial/temporal block diagonal averaging to smooth out variations of second-derivate over different parameters/iterations. Extensive tests on NLP, CV, and recommendation system tasks, show that AdaHessian achieves state-of-the-art results, with 10x less sensitivity to hyperparameter tuning as compared to ADAM.
In particular, we find that AdaHessian:
(i) outperforms AdamW for transformers by 0.13/0.33 BLEU score on IWSLT14/WMT14, 2.7/1.0 PPL on PTB/Wikitext-103;
(ii) outperforms AdamW for SqueezeBert by 0.41 points on GLUE;
(iii) achieves 1.45%/5.55% higher accuracy on ResNet32/ResNet18 on Cifar10/ImageNet as compared to Adam; and
(iv) achieves 0.032% better score than AdaGrad for DLRM on the Criteo Ad Kaggle dataset.
The cost per iteration of AdaHessian is comparable to first-order methods, and AdaHessian exhibits improved robustness towards variations in hyperparameter values.
Homepage
Homepage
Distributed Deep Learning with Second Order Information
Abstract: As the scale of deep neural networks continues to increase exponentially, distributed training is becoming an essential tool in deep learning. Especially in the context of un/semi/self-supervised pretraining, larger models tend to achieve much higher accuracy. This trend is especially clear in natural language processing, where the latest GPT-3 model has 175 billion parameters. The training of such models requires hybrid data+model-parallelism. In this talk, I will describe two of our recent efforts; 1) second-order optimization and 2) reducing memory footprint, in the context of large-scale distributed deep learning.
Homepage
High Performance Tensor Computations
Abstract: Tensor decompositions, contractions, and tensor networks are prevalent in applications ranging from data modeling to simulation of quantum systems. Numerical kernels within these methods present challenges associated with sparsity, symmetry, and other types of tensor structure. We describe recent innovations in algorithms for tensor contractions and tensor decompositions, which minimize costs and improve scalability. Further, we highlight new libraries for (1) automatic differentiation in the context of high-order tensor optimization, (2) efficient tensor decomposition, and (3) tensor network state simulation. These libraries all build on distributed tensor contraction kernels for sparse and dense tensors provided by the Cyclops library, enabling a shared ecosystem for applications of tensor computations.
Homepage
Light-Weight Performance Analysis for Next-Generation HPC Systems
Abstract: Building efficient and scalable performance analysis and optimizing tools, for large-scale systems, is increasingly important both for the developers of parallel applications and the designers of next-generation HPC systems. However, conventional performance tools suffer from significant time/space overhead due to the ever-increasing problem size and system scale. On the other hand, the cost of source code analysis is independent of the problem size and system scale, making it very appealing for large-scale performance analysis. Inspired by this observation, we have designed a series of light-weight performance tools for HPC systems, such as memory access monitoring, performance variance detection, and communication compression. In this talk, I will share our expreience on building these tools through combining static analysis and runtime analysis and also point out the main challenges in this direction.
Homepage
We encourage all Bcast attendees to attend this year's virtual Supercomputing conference!
Decomposing MPI Collectives for Exploiting Multi-lane Communication
Abstract: Many modern, high-performance systems increase the cumulated node-bandwidth by offering more than a single communication network and/or by having multiple connections to the network, such that a single processor-core cannot by itself saturate the off-node bandwidth. Efficient algorithms and implementations for collective operations as found in, e.g., MPI, must be explicitly designed for exploiting such multi-lane capabilities. We are interested in gauging to which extent this might be the case.
In the talk, I will illustrate how we systematically decompose the MPI collectives into similar operations that can execute concurrently on and exploit multiple network lanes. Our decomposition is applicable to all standard, regular MPI collectives, and our implementations' performance can be readily compared to the native collectives of any given MPI library. Contrary to expectation, our full-lane, performance guideline implementations in many cases show surprising performance improvements with different MPI libraries on different systems, indicating severe problems with native MPI library implementations. In many cases, our full-lane implementations are large factors faster than the corresponding library MPI collectives. The results indicate considerable room for improvement of the MPI collectives in current MPI libraries including a more efficient use of multi-lane capabilities.
Homepage
Large Graph Processing on Heterogeneous Architectures: Systems, Applications and Beyond
Abstract: Graphs are de facto data structures for many data processing applications, and their volume is ever growing. Many graph processing tasks are computation intensive and/or memory intensive. Therefore, we have witnessed a significant amount of effort in accelerating graph processing tasks with heterogeneous architectures like GPUs, FPGAs and even ASICs. In this talk, we will first review the literatures of large graph processing systems on heterogeneous architectures. Next, we present our research efforts, and demonstrate the significant performance impact of hardware-software co-design on designing high performance graph computation systems and applications. Finally, we outline the research agenda on challenges and opportunities in the system and application development of future graph processing.
Homepage