What: SPCL_Bcast is an open, online seminar series that covers a broad range of topics around parallel and high-performance computing, scalable machine learning, and related areas.

Who: We invite top researchers and engineers from all over the world to speak.

Where: Anyone is welcome to join over Zoom! This link will always redirect to the right Zoom meeting. When possible, we make recordings available on our YouTube channel.

Join Now

Old talks: See the SPCL_Bcast archive.

Social media: Follow along with #spcl_bcast on Twitter!

When: Every two weeks on Thursdays, in one of two slots (depending on speaker).

14 January – 11 March, 2021:

  • Morning: 9 AM Zurich, 5 PM Tokyo, 4 PM Beijing, 3 AM New York, 12 AM (midnight) San Francisco
  • Evening: 6 PM Zurich, 2 AM (Friday) Tokyo, 1 AM (Friday) Beijing, 12 PM (noon) New York, 9 AM San Francisco

25 March, 2021:

  • Morning: 9 AM Zurich, 5 PM Tokyo, 4 PM Beijing, 4 AM New York, 1 AM San Francisco
  • Evening: 6 PM Zurich, 2 AM (Friday) Tokyo, 1 AM (Friday) Beijing, 1 PM New York, 10 AM San Francisco

8 April – 20 May, 2021:

  • Morning: 9 AM Zurich, 4 PM Tokyo, 3 PM Beijing, 3 AM New York, 12 AM (midnight) San Francisco
  • Evening: 6 PM Zurich, 1 AM (Friday) Tokyo, 12 AM (midnight) Beijing, 12 PM (noon) New York, 9 AM San Francisco

14 January, 2021 — Brian Van Essen (LLNL)
6 PM Zurich, 2 AM (Friday) Tokyo, 1 AM (Friday) Beijing, 1 PM New York, 9 AM San Francisco

Enabling Rapid COVID-19 Small Molecule Drug Design Through Scalable Deep Learning of Generative Models

Abstract: We improved the quality and reduced the time to produce machine-learned models for use in small molecule antiviral design. Our globally asynchronous multi-level parallel training approach strong scales to all of Sierra with up to 97.7% efficiency. We trained a novel, character-based Wasserstein autoencoder that produces a higher quality model trained on 1.613 billion compounds in 23 minutes while the previous state-of-the-art takes a day on 1 million compounds. Reducing training time from a day to minutes shifts the model creation bottleneck from computer job turnaround time to human innovation time. Our implementation achieves 318 PFLOPS for 17.1% of half-precision peak. We will incorporate this model into our molecular design loop, enabling the generation of more diverse compounds: searching for novel, candidate antiviral drugs improves and reduces the time to synthesize compounds to be tested in the lab.

Picture of Brian Van Essen Bio: Brian Van Essen is the informatics group leader and a computer scientist at the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory (LLNL). He is pursuing research in large-scale deep learning for scientific domains and training deep neural networks using high-performance computing systems. He is the project leader for the Livermore Big Artificial Neural Network open-source deep learning toolkit, and the LLNL lead for the ECP ExaLearn and CANDLE projects. Additionally, he co-leads an effort to map scientific machine learning applications to neural network accelerator co-processors as well as neuromorphic architectures. He joined LLNL in 2010 after earning his Ph.D. and M.S. in computer science and engineering at the University of Washington. He also has an M.S and B.S. in electrical and computer engineering from Carnegie Mellon University.

28 January, 2021 — Haohuan Fu (Tsinghua University)
9 AM Zurich, 5 PM Tokyo, 4 PM Beijing, 4 AM New York, 12 AM (midnight) San Francisco

Optimizing CESM-HR on Sunway TaihuLight and An Unprecedented Set of Multi-Century Simulations

Abstract: CESM is one of the very first and most complex scientific codes that gets migrated onto Sunway TaihuLight. Being a community code involving hundreds of different dynamic, physics, and chemistry processes, CESM brings severe challenges for the many-core architecture and the parrallel scale of Sunway TaihuLight. This talk summarizes our continuous effort on enabling efficient run of CESM on Sunway, starting from refactoring of CAM in 2015, redesigning of CAM in 2016 and 2017, and a collaborative effort starting in 2018 to enable highly efficient simulations of the high-resolution (25 km atmosphere and 10 km ocean) Community Earth System Model (CESM-HR) on Sunway Taihu-Light. The refactoring and optimizing efforts have improved the simulation speed of CESM-HR from 1 SYPD (simulation years per day) to 5 SYPD (with output disabled). Using CESM-HR, We manage to provide an unprecedented set of high-resolution climate simulations, consisting of a 500-year pre-industrial control simulation and a 250-year historical and future climate simulation from 1850 to 2100. Overall, high-resolution simulations show significant improvements in representing global mean temperature changes, seasonal cycle of sea-surface temperature and mixed layer depth, extreme events and in relationships between extreme events and climate modes.

Picture of Haohuan Fu Bio: Haohuan Fu is a professor in the Ministry of Education Key Laboratory for Earth System Modeling, and Department of Earth System Science in Tsinghua University, where he leads the research group of High Performance Geo-Computing (HPGC). He is also the deputy director of the National Supercomputing Center in Wuxi, leading the research and development division. Fu has a PhD in computing from Imperial College London. His research work focuses on providing both the most efficient simulation platforms and the most intelligent data management and analysis platforms for geoscience applications, leading to two consecutive winning of the ACM Gordon Bell Prizes (nonhydrostatic atmospheric dynamic solver in 2016, and nonlinear earthquake simulation in 2017).

11 February, 2021 — Jeff Hammond (Intel HPC)
6 PM Zurich, 2 AM (Friday) Tokyo, 1 AM (Friday) Beijing, 1 PM New York, 9 AM San Francisco

Evaluating modern programming models using the Parallel Research Kernels

Abstract: The Parallel Research Kernels were developed to support empirical studies of programming models in a variety of contexts without the porting effort required by proxy or mini-applications. I will describe the project and why it has been a useful tool in a variety of contexts and present some of our findings related to modern C++ parallelism for CPU and GPU architectures.

Picture of Jeff Hammond Bio: Jeff Hammond is a Principal Engineer at Intel where he works on a wide range of high-performance computing topics, including parallel programming models, system architecture and open-source software. Previously, Jeff worked at the Argonne Leadership Computing Facility where he worked on Blue Gene and built things with MPI. Jeff received his PhD in Physical Chemistry from the University of Chicago for research performed in collaboration with the NWChem team at Pacific Northwest National Laboratory.

25 February, 2021 — Jiajia Li (PNNL)
6 PM Zurich, 2 AM (Friday) Tokyo, 1 AM (Friday) Beijing, 1 PM New York, 9 AM San Francisco

High-Performance Sparse Tensor Operations in HiParTI Library

Abstract: This talk will present the recent development of HiParTI, a Hierarchical Parallel Tensor Infrastructure. I will emphasize on the element-wise sparse tensor contractions, commonly shown in quantum chemistry, physics, and others. We introduce three optimization techniques by using multi-dimensional, efficient hashtable representation for the accumulator and larger input tensor, and all-stage parallelization. Evaluating with 15 datasets, we obtain 28 - 576x speedup over the traditional sparse tensor contraction. With our proposed algorithm- and memory heterogeneity-aware data management, extra performance improvement is achieved on the heterogeneous memory with DRAM and Intel Optane DC Persistent Memory Module (PMM) over a state-of-the-art solutions.

Picture of Jiajia Li Bio: Jiajia Li is a research scientist in High Performance Computing group at Pacific Northwest National Laboratory (PNNL). She has received her Ph.D. degree from Georgia Institute of Technology in 2018. Her current research emphasizes on optimizing tensor methods especially for sparse data from diverse applications by utilizing various parallel architectures. She is an awardee of Best Student Paper Award at SC'18, Best Paper Finalist at PPoPP'19, and "A Rising Star in Computational and Data Sciences". She has served on the technical program committee of conferences/journals, such as PPoPP, SC, ICS, IPDPS, ICPP, LCTES, Cluster, ICDCS, TPDS, etc. In the past, she had received a Ph.D. degree from Institute of Computing Technology at Chinese Academy of Sciences, China and a B.S. degree in Computational Mathematics from Dalian University of Technology, China.

11 March, 2021 — Michael Bauer (NVIDIA Research)
9 AM Zurich, 5 PM Tokyo, 4 PM Beijing, 4 AM New York, 12 AM (midnight) San Francisco

HPHPC: High Productivity High Performance Computing with Legion and Legate

Abstract: This talk will describe the co-design and implementation of Legion and Legate, two programming systems that synergistically combine to provide to high productivity high performance computing ecosystem. In the first part of the talk, we'll introduce Legion, a task-based runtime system for supercomputers with a strong data model that enables a sophisticated dependence analysis. The second part of the talk will cover Legate, a framework for constructing drop-in replacements for popular Python libraries such as NumPy and Pandas on top of Legion. We'll show how using Legate and Legion together allows users to run unmodified Python programs at scale on hundreds of GPUs simply by changing a few import statements. We'll also discuss how the Legate framework makes it possible to compose such libraries even in distributed settings.

Bio: Michael Bauer is a principal research scientist at NVIDIA Research where he works on making it easier to program large clusters of GPUs. He is the primary author of the Legion runtime.

25 March, 2021 — Gerhard Wellein (FAU)
6 PM Zurich, 2 AM (Friday) Tokyo, 1 AM (Friday) Beijing, 1 PM New York, 10 AM San Francisco

Performance Engineering for Sparse Matrix-Vector Multiplication: Some new ideas for old problems

Abstract: The sparse matrix-vector multiplication (SpMV) kernel is a key performance component of numerous algorithms in computational science. Despite the kernel's apparent simplicity, the sparse and potentially irregular data access patterns of SpMV and its intrinsically low computational intensity haven been challenging the development of high-performance implementations over decades. Still these developments are rarely guided by appropriate performance models.

This talk will address the basic problem of understanding (i.e., modelling) and improving the computational intensity of SpMV kernels with a focus on symmetric matrices. Using a recursive algebraic coloring (RACE) of the underlying undirected graph, a node-level parallel symmetric SpMV implementation is developed which increases the computational intensity and the performance for a large general set of matrices by a factor of up to 2x. The same idea is then applied to accelerate the computation sparse matrix powers via cache blocking.

Picture of Gerhard Wellein Bio: Gerhard Wellein is a Professor for High Performance Computing at the Department for Computer Science at the University of Erlangen-Nuremberg and holds a PhD in theoretical physics from the University of Bayreuth. Since 2001 he heads the Erlangen National Center for High Performance Computing, he is the deputy speaker of the Bavarian HPC network KONWIHR and he is member of the scientific steering committee of the Gauss-Centre for Supercomputing (GCS).

Gerhard Wellein has more than twenty years of experience in teaching HPC techniques to students and scientists from computational science and engineering, is an external trainer in the Partnership for Advanced Computing in Europe (PRACE) and received the "2011 Informatics Europe Curriculum Best Practices Award" (together with Jan Treibig and Georg Hager) for outstanding teaching contributions. His research interests focus on performance modelling and performance engineering, architecture-specific code optimization, novel parallelization approaches and hardware-efficient building blocks for sparse linear algebra and stencil solvers. He has been conducting and leading numerous HPC projects including the German Japanese project "Equipping Sparse Solvers for Exascale" (ESSEX) within the DFG priority program SPPEXA ("Software for Exascale Computing").

8 April, 2021 — Steve Reinhardt (Microsoft)
6 PM Zurich, 1 AM (Friday) Tokyo, 12 AM (midnight) Beijing, 12 PM (noon) New York, 9 AM San Francisco

Cloud-Scale Inference on FPGAs at Microsoft Bing

Abstract: Microsoft's Project Catapult began nearly a decade ago, leading to the widespread deployment of FPGAs in Microsoft's data centers for application and network acceleration. Project Brainwave began five years later, applying those FPGAs to accelerate DNN inference for Bing and later other Microsoft cloud services. FPGA flexibility has enabled the Brainwave architecture to evolve rapidly, keeping pace with rapid developments in the DNN model space. The low cost of updating FPGA-based designs also enables greater risk taking, facilitating innovations such as our Microsoft Floating Point (MSFP) data format. FPGAs with hardened support for MSFP will provide a new level of performance for Brainwave. These AI-optimized FPGAs also introduce a new point in the hardware spectrum between general-purpose devices and domain-specific accelerators. Going forward, a key challenge for accelerator architects will be finding the right balance between hardware specialization, hardware configurability, and software programmability.

Picture of Steve Reinhardt Bio: Steven K. Reinhardt is a Partner Hardware Engineering Manager in the Bing Platform Engineering group. His team leads the development and production deployment of the Brainwave FPGA-based DNN inference accelerator in support of Bing and Office 365. Prior to joining Microsoft, Steve was a Senior Fellow at AMD Research, where he led research on heterogeneous systems and high-performance networking. Before that, he was an Associate Professor in the EECS department at the University of Michigan. Steve has published over 50 refereed conference and journal articles. He was also a primary architect and developer of M5 (now gem5), a widely used open-source full-system architecture simulator. Steve received a Ph.D. in Computer Sciences from the University of Wisconsin-Madison, and is an IEEE Fellow and an ACM Distinguished Scientist.

22 April, 2021 — Maryam Mehri Dehnavi (University of Toronto)
6 PM Zurich, 1 AM (Friday) Tokyo, 12 AM (midnight) Beijing, 12 PM (noon) New York, 9 AM San Francisco — Zoom

Inspecting Irregular Computation Patterns to Generate Fast Code

Abstract: Sparse matrix methods are at the heart of many scientific computations and data analytics codes. Sparse matrix kernels often dominate the overall execution time of many simulations. Further, the indirection from indexing and looping over the nonzero elements of a sparse data structure often limits the optimization of such codes. In this talk, I will introduce Sympiler, a domain-specific code generator that transforms computation patterns in sparse matrix methods for high-performance. Specifically, I will show how decoupling symbolic analysis from numerical manipulation will enable the automatic optimization of sparse codes. I will also demonstrate the application of symbolic analysis in accelerating quadratic program solvers.

Picture of Maryam Mehri Dehnavi Bio: Maryam Mehri Dehnavi is an Assistant Professor in the Computer Science department at the University of Toronto and is the Canada Research Chair in parallel and distributed computing. Her research focuses on high-performance computing and domain-specific compiler design. Previously, she was an Assistant Professor at Rutgers University and a postdoctoral researcher at MIT. She received her Ph.D. from McGill University in 2013. Some of her recognitions include the Canada Research Chair award, the Ontario Early Researcher award, and the ACM SRC grand finale prize.

6 May, 2021 — Aparna Chandramowlishwaran (UC Irvine)
6 PM Zurich, 1 AM (Friday) Tokyo, 12 AM (midnight) Beijing, 12 PM (noon) New York, 9 AM San Francisco — Zoom

Details to be announced.

20 May, 2021 — Sunita Chandrasekaran (University of Delaware)
6 PM Zurich, 1 AM (Friday) Tokyo, 12 AM (midnight) Beijing, 12 PM (noon) New York, 9 AM San Francisco — Zoom

Details to be announced.