_Bcast is an open, online seminar series that covers a broad range of topics around parallel and high-performance computing, scalable machine learning, and related areas.
Who: We invite top researchers and engineers from all over the world to speak.
Old talks: See the SPCL
Social media: Follow along with #spcl_bcast on Twitter!
When: Every two weeks on Thursdays, in one of two slots (depending on speaker).
14 January – 11 March, 2021:
- Morning: 9 AM Zurich, 5 PM Tokyo, 4 PM Beijing, 3 AM New York, 12 AM (midnight) San Francisco
- Evening: 6 PM Zurich, 2 AM (Friday) Tokyo, 1 AM (Friday) Beijing, 12 PM (noon) New York, 9 AM San Francisco
25 March, 2021:
- Morning: 9 AM Zurich, 5 PM Tokyo, 4 PM Beijing, 4 AM New York, 1 AM San Francisco
- Evening: 6 PM Zurich, 2 AM (Friday) Tokyo, 1 AM (Friday) Beijing, 1 PM New York, 10 AM San Francisco
8 April – 20 May, 2021:
- Morning: 9 AM Zurich, 4 PM Tokyo, 3 PM Beijing, 3 AM New York, 12 AM (midnight) San Francisco
- Evening: 6 PM Zurich, 1 AM (Friday) Tokyo, 12 AM (midnight) Beijing, 12 PM (noon) New York, 9 AM San Francisco
Enabling Rapid COVID-19 Small Molecule Drug Design Through Scalable Deep Learning of Generative Models
Abstract: We improved the quality and reduced the time to produce machine-learned models for use in small molecule antiviral design. Our globally asynchronous multi-level parallel training approach strong scales to all of Sierra with up to 97.7% efficiency. We trained a novel, character-based Wasserstein autoencoder that produces a higher quality model trained on 1.613 billion compounds in 23 minutes while the previous state-of-the-art takes a day on 1 million compounds. Reducing training time from a day to minutes shifts the model creation bottleneck from computer job turnaround time to human innovation time. Our implementation achieves 318 PFLOPS for 17.1% of half-precision peak. We will incorporate this model into our molecular design loop, enabling the generation of more diverse compounds: searching for novel, candidate antiviral drugs improves and reduces the time to synthesize compounds to be tested in the lab.
Optimizing CESM-HR on Sunway TaihuLight and An Unprecedented Set of Multi-Century Simulations
Abstract: CESM is one of the very first and most complex scientific codes that gets migrated onto Sunway TaihuLight. Being a community code involving hundreds of different dynamic, physics, and chemistry processes, CESM brings severe challenges for the many-core architecture and the parrallel scale of Sunway TaihuLight. This talk summarizes our continuous effort on enabling efficient run of CESM on Sunway, starting from refactoring of CAM in 2015, redesigning of CAM in 2016 and 2017, and a collaborative effort starting in 2018 to enable highly efficient simulations of the high-resolution (25 km atmosphere and 10 km ocean) Community Earth System Model (CESM-HR) on Sunway Taihu-Light. The refactoring and optimizing efforts have improved the simulation speed of CESM-HR from 1 SYPD (simulation years per day) to 5 SYPD (with output disabled). Using CESM-HR, We manage to provide an unprecedented set of high-resolution climate simulations, consisting of a 500-year pre-industrial control simulation and a 250-year historical and future climate simulation from 1850 to 2100. Overall, high-resolution simulations show significant improvements in representing global mean temperature changes, seasonal cycle of sea-surface temperature and mixed layer depth, extreme events and in relationships between extreme events and climate modes.
Evaluating modern programming models using the Parallel Research Kernels
Abstract: The Parallel Research Kernels were developed to support empirical studies of programming models in a variety of contexts without the porting effort required by proxy or mini-applications. I will describe the project and why it has been a useful tool in a variety of contexts and present some of our findings related to modern C++ parallelism for CPU and GPU architectures.
High-Performance Sparse Tensor Operations in HiParTI Library
Abstract: This talk will present the recent development of HiParTI, a Hierarchical Parallel Tensor Infrastructure. I will emphasize on the element-wise sparse tensor contractions, commonly shown in quantum chemistry, physics, and others. We introduce three optimization techniques by using multi-dimensional, efficient hashtable representation for the accumulator and larger input tensor, and all-stage parallelization. Evaluating with 15 datasets, we obtain 28 - 576x speedup over the traditional sparse tensor contraction. With our proposed algorithm- and memory heterogeneity-aware data management, extra performance improvement is achieved on the heterogeneous memory with DRAM and Intel Optane DC Persistent Memory Module (PMM) over a state-of-the-art solutions.
HPHPC: High Productivity High Performance Computing with Legion and Legate
Abstract: This talk will describe the co-design and implementation of Legion and Legate, two programming systems that synergistically combine to provide to high productivity high performance computing ecosystem. In the first part of the talk, we'll introduce Legion, a task-based runtime system for supercomputers with a strong data model that enables a sophisticated dependence analysis. The second part of the talk will cover Legate, a framework for constructing drop-in replacements for popular Python libraries such as NumPy and Pandas on top of Legion. We'll show how using Legate and Legion together allows users to run unmodified Python programs at scale on hundreds of GPUs simply by changing a few import statements. We'll also discuss how the Legate framework makes it possible to compose such libraries even in distributed settings.
Performance Engineering for Sparse Matrix-Vector Multiplication: Some new ideas for old problems
Abstract: The sparse matrix-vector multiplication (SpMV) kernel is a key performance component of numerous algorithms in computational science. Despite the kernel's apparent simplicity, the sparse and potentially irregular data access patterns of SpMV and its intrinsically low computational intensity haven been challenging the development of high-performance implementations over decades. Still these developments are rarely guided by appropriate performance models.
This talk will address the basic problem of understanding (i.e., modelling) and improving the computational intensity of SpMV kernels with a focus on symmetric matrices. Using a recursive algebraic coloring (RACE) of the underlying undirected graph, a node-level parallel symmetric SpMV implementation is developed which increases the computational intensity and the performance for a large general set of matrices by a factor of up to 2x. The same idea is then applied to accelerate the computation sparse matrix powers via cache blocking.
Gerhard Wellein has more than twenty years of experience in teaching HPC techniques to students and scientists from computational science and engineering, is an external trainer in the Partnership for Advanced Computing in Europe (PRACE) and received the "2011 Informatics Europe Curriculum Best Practices Award" (together with Jan Treibig and Georg Hager) for outstanding teaching contributions. His research interests focus on performance modelling and performance engineering, architecture-specific code optimization, novel parallelization approaches and hardware-efficient building blocks for sparse linear algebra and stencil solvers. He has been conducting and leading numerous HPC projects including the German Japanese project "Equipping Sparse Solvers for Exascale" (ESSEX) within the DFG priority program SPPEXA ("Software for Exascale Computing").
Cloud-Scale Inference on FPGAs at Microsoft Bing
Abstract: Microsoft's Project Catapult began nearly a decade ago, leading to the widespread deployment of FPGAs in Microsoft's data centers for application and network acceleration. Project Brainwave began five years later, applying those FPGAs to accelerate DNN inference for Bing and later other Microsoft cloud services. FPGA flexibility has enabled the Brainwave architecture to evolve rapidly, keeping pace with rapid developments in the DNN model space. The low cost of updating FPGA-based designs also enables greater risk taking, facilitating innovations such as our Microsoft Floating Point (MSFP) data format. FPGAs with hardened support for MSFP will provide a new level of performance for Brainwave. These AI-optimized FPGAs also introduce a new point in the hardware spectrum between general-purpose devices and domain-specific accelerators. Going forward, a key challenge for accelerator architects will be finding the right balance between hardware specialization, hardware configurability, and software programmability.
Inspecting Irregular Computation Patterns to Generate Fast Code
Abstract: Sparse matrix methods are at the heart of many scientific computations and data analytics codes. Sparse matrix kernels often dominate the overall execution time of many simulations. Further, the indirection from indexing and looping over the nonzero elements of a sparse data structure often limits the optimization of such codes. In this talk, I will introduce Sympiler, a domain-specific code generator that transforms computation patterns in sparse matrix methods for high-performance. Specifically, I will show how decoupling symbolic analysis from numerical manipulation will enable the automatic optimization of sparse codes. I will also demonstrate the application of symbolic analysis in accelerating quadratic program solvers.
Transferable Deep Learning Surrogates for Solving PDEs
Abstract: Partial differential equations (PDEs) are ubiquitous in science and engineering to model physical phenomena. Notable PDEs are the Laplace and Navier-Stokes equations with numerous applications in fluid dynamics, electrostatics, and steady-state heat transfer. Solving such PDEs relies on numerical methods such as finite element, finite difference, and finite volume. While these methods are extremely powerful, they are also computationally expensive. Despite widespread efforts to improve the performance and scalability of solving these systems of PDEs, several problems remain intractable.
In this talk, we'll explore the potential of deep learning (DL)-based surrogates to both augment and replace numerical simulations. In the first part of the talk, we'll present two frameworks -- CFDNet and SURFNet, that couple simulations with a convolutional neural network to accelerate the convergence of the overall scheme without relaxing the convergence constraints of the physics solver. The second part of the talk will introduce another novel framework that leverages DL to build a transferable deep neural network surrogate that solves PDEs in unseen domains with arbitrary boundary conditions. We'll show that a DL model trained only once can be used forever without re-training to solve PDEs in large and complex domains with unseen sizes, shapes, and boundary conditions. Compared with the state-of-the-art physics-informed neural networks for solving PDEs, we demonstrate 1-3 orders of magnitude speedups while achieving comparable or better accuracy.
Exploring Tools & Techniques for the Frontier Exascale System: Challenges vs Opportunities
Abstract: PIConGPU, an extremely scalable, heterogeneous, fully relativistic particle-in-cell (PIC) C++ code provides a modern simulation framework for laser-plasma physics and laser-matter interactions suitable for production-quality runs on large scale systems. This plasma physics application is fueled by alpaka abstraction library and incorporates openPMD-API enabling I/O libraries such as ADIOS2. Among many supercomputing systems, PIConGPU has been running on ORNL's Titan, Summit and is expected to run on the Exascale system, Frontier, that is being built as we speak. This talk will discuss some of the challenges, opportunities and potential solutions with respect to maintaining a performant portable code while migrating the same to Frontier.
High Performance Computing: Beyond Moore's Law
Abstract: Supercomputer performance now exceeds that of the earliest computers by thirteen orders of magnitude, yet science still needs more than they provide. But with Dennard scaling and Moore's Law ending even as AI and HPC demand continued growth. Demand engenders supply, and ways to prolong the growth in supercomputing performance are at hand or on the horizon. Architectural specialization has returned, after a loss of system diversity in the Moore's law era; it provides a significant boost for computational science. And at the hardware level, the development by Cerebras of a viable wafer-scale compute platform has important ramifications. Other long-term possibilities, notably quantum computing, may eventually play a role.
Why wafer-scale? Real achieved performance in supercomputers (as opposed to the peak speed) is limited by the bandwidth and latency barriers --- memory and communication walls --- that impose delay when off-processor-chip data is needed, and it is needed all the time. By changing the scale of the chip by two orders of magnitude, we can pack a small, powerful, mini-supercomputer on one piece of silicon, and eliminate much of the off-chip traffic for applications that can fit in the available memory. The elimination of most off-chip communication also cuts the power per unit performance, a key parameter when total system power is capped, as it usually is.
Cerebras overcame technical problems concerning yield, packaging, cooling, and delivery of electrical power in order to make wafer-scale computing viable. The Cerebras second generation wafer has over 800,000 identical processing elements architected with features that support sparsity and power-efficient performance. For ML, algorithmic innovations such as conditional computations and model and data sparsity promise significant savings in memory and computation while preserving model capacity. Flexible hardware rather than dense matrix multiply is required to best exploit these algorithmic innovations. We will discuss the aspects of the architecture that meet that requirement.
Heterogeneous System Architectures: A Strategy to Use Diverse Components
Abstract: Current system architectures rely on a simple approach: one compute node design that is used across the entire system. This approach only supports heterogeneity at the node level. Compute nodes may involve a variety of devices but the system is otherwise homogeneous. This design simplifies scheduling applications and provides consistent expectations for the hardware that a job can exploit but often results in poor utilization of components. The wide range of emerging devices for AI and other domains necessitates a more heterogeneous system architecture that varies the compute node (or volume) types within a single job.
Lawrence Livermore National Laboratory (LLNL) is currently exploring such heterogeneous system architectures. These explorations include the use of novel hardware to accelerate AI models within larger applications and initial software solutions to overcome the challenges posed by heterogeneous system architectures. This talk will present a sampling of the novel software solutions that enable the heterogeneous system architecture as well as the systems that LLNL has currently deployed.
Abstract: Today's AI is too big. Deep neural networks demand extraordinary levels of data and computation, and therefore power, for training and inference. This severely limits the practical deployment of AI in edge devices. We aim to improve the efficiency of neural network design. First, I'll present MCUNet that brings deep learning to IoT devices. MCUNet is a framework that jointly designs the efficient neural architecture (TinyNAS) and the light-weight inference engine (TinyEngine), enabling ImageNet-scale inference on micro-controllers that have only 1MB of Flash. Next I will introduce Once-for-All Network, an efficient neural architecture search approach, that can elastically grow and shrink the model capacity according to the target hardware resource and latency constraints. From inference to training, I'll present TinyTL that enables tiny transfer learning on-device, reducing the memory footprint by 7-13x. Finally, I will describe data-efficient GAN training techniques that can generate photo-realistic images using only 100 images, which used to require tens of thousands of images. We hope such TinyML techniques can make AI greener, faster, more efficient and more sustainable.