Theodoros Rekatsinas
The Scalable Parallel Computing Lab's *SPCL_Bcast* seminar continues
with *Theodoros Rekatsinas**of **Axelera AI* presenting on *Data
Selection - Data Challenges when Training Generative Models*. Everyone
is welcome to attend (over Zoom)!
*When:* Thursday, 8th May, 9AM CET
*Where:* Zoom
Join <https://spcl.inf.ethz.ch/Bcast/join>
*Abstract:* This talk explores how strategic data selection can improve
the efficiency of training generative AI models. I will cover approaches
for both pre-training and fine-tuning that achieve comparable
performance to full training while using only a fraction of the data.
During the talk I will cover key filtering techniques and data selection
methods for efficient pre-training as well as the connection between
data selection and optimal transport for optimized fine-tuning. I will
conclude with promising future directions for adaptive data selection
research.
*Biography:* Theo Rekatsinas is the VP of Machine Learning at Axelera
AI. before that he was a tech lead at Apple working on on-device
intelligence and a senior manager in the Apple Knowledge Graph (KG) team
responsible for the KG construction and Graph Machine learning teams.
Theo co-founded Inductiv (acquired by Apple), a company that developed
Generative AI solutions for identifying and correcting errors in data.
Theo was also a Professor of Computer Science at ETH Zürich and the
University of Wisconsin-Madison. Theo's research focuses on scalable
machine learning over billion-scale relational and graph-structured
data. His research focused on exploring the fundamental connections
between data preparation, data integration, and knowledge management
with statistical machine learning and probabilistic inference.
More details & future talks <https://spcl.inf.ethz.ch/Bcast/>
Scalable Parallel Computing Lab (SPCL)
Department of Computer Science, ETH Zurich
Website <https://spcl.inf.ethz.ch> X(Twitter)
<https://twitter.com/spcl_eth> YouTube <https://www.youtube.com/@spcl>
GitHub <https://github.com/spcl>
John Mellor-Crummey
The Scalable Parallel Computing Lab's *SPCL_Bcast* seminar continues
with *John Mellor-Crummey**of **Rice University* presenting on
*Measurement and Analysis of Application Performance on GPU-accelerated
Systems at Exascale*. Everyone is welcome to attend (over Zoom)!
*When:* Thursday, 13th March, 6PM CET
*Where:* Zoom
Join <https://spcl.inf.ethz.ch/Bcast/join>
*Abstract:* As part of the US DOE's Exascale Computing Project, Rice
University began extending its HPCToolkit performance tools to support
instruction-level measurement and analysis of applications executing on
GPU-accelerated exascale supercomputers. Hardware support for
instruction-level performance measurement in AMD, Intel, and NVIDIA GPUs
was developed at the urging of the HPCToolkit project team. HPCToolkit
employs PC sampling or binary instrumentation to perform
instruction-level measurements of GPU computations. When measuring a
GPU-accelerated application, HPCToolkit employs a novel wait-free data
structure to communicate performance measurements between tool threads
and application threads. To help attribute performance information in
detail, HPCToolkit performs parallel analysis of large CPU and GPU
binaries involved in the execution of an exascale application to rapidly
recover mappings between machine instructions and source code. To
analyze terabytes of performance measurements gathered during executions
at exascale, HPCToolkit employs distributed-memory parallelism,
multithreading, sparse data structures, and out-of-core streaming
analysis algorithms. To support interactive exploration of profiles up
to terabytes in size, HPCToolkit's hpcviewer GUI uses out-of-core
methods to visualize performance data. These strategies have enabled
HPCToolkit to efficiently measure, analyze and explore terabytes of
performance data for executions using as many as 64K MPI ranks and 64K
GPU tiles on ORNL's Frontier supercomputer. This talk will describe key
aspects of HPCToolkit, successes analyzing applications, and some
challenges ahead.
*Biography:* John Mellor-Crummey is a Professor of Computer Science at
Rice University in Houston, TX, USA. His principal research focus at
present is tools for measurement and analysis of application
performance. His past work includes scalable synchronization algorithms
for shared-memory multiprocessors, compilers and runtime systems for
parallel computing, techniques for execution replay of parallel
programs, tools for dynamic data race detection, and techniques for
network performance analysis and optimization. Mellor-Crummey co-led
development of the OMPT tools interface for OpenMP 5. He is a
co-recipient of the 2006 Dijkstra Prize in Distributed Computing,
co-recipient of a 2024 Honor Award from the US Secretary of Energy, and
a Fellow of the ACM.
More details & future talks <https://spcl.inf.ethz.ch/Bcast/>
Scalable Parallel Computing Lab (SPCL)
Department of Computer Science, ETH Zurich
Website <https://spcl.inf.ethz.ch> X(Twitter)
<https://twitter.com/spcl_eth> YouTube <https://www.youtube.com/@spcl>
GitHub <https://github.com/spcl>
Jesper Larsson Träff
The Scalable Parallel Computing Lab's *SPCL_Bcast* seminar continues
with *Jesper Larsson Träff**of **TU Wien (Vienna University of
Technology)* presenting on *Broadcast, Reduction and beyond with Block
Schedules and Circulant Graphs*. Everyone is welcome to attend (over Zoom)!
*When:* Thursday, 12th December, 10AM CET
*Where:* Zoom
Join <https://spcl.inf.ethz.ch/Bcast/join>
*Abstract:* We present a round-optimal algorithm for broadcasting n
indivisible blocks of data over p processors communicating in a regular,
logarithmic degree circulant graph pattern. This broadcast algorithm
immediately leads to partly new, likewise round-optimal algorithms for
the reduction to root, all-to-all broadcast (allgatherv) and irregular
and regular reduce-scatter operations. The broadcast algorithm relies on
block schedules with certain properties which we indicate can be
computed optimally in O(log p) operations per processor without
communication. The communication pattern and algorithms are attractive
for implementing most of the standard, dense collective operations of MPI.
*Biography:* Jesper Larsson Träff is professor for Parallel Computing at
TU Wien (Vienna University of Technology) since 2011. From 2010 to 2011
he was guest professor for Scientific Computing at the University of
Vienna. From 1998 until 2010 he was working at the NEC Laboratories
Europe in Sankt Augustin, Germany on efficient implementations of MPI
for NEC vector supercomputers; this work led to a doctorate (Dr.
Scient.; Habilitation) from the University of Copenhagen in 2009. From
1995 to 1998 he spent four years as PostDoc/Research Associate in the
Algorithms Group of the Max-Planck Institute for Computer Science in
Saarbrücken, and the Efficient Algorithms Group at the Technical
University of Munich. He received an M.Sc. in computer science in 1989,
and, after two interim years at the industrial research center ECRC in
Munich, a Ph.D. in 1995, both from the University of Copenhagen.
More details & future talks <https://spcl.inf.ethz.ch/Bcast/>
Scalable Parallel Computing Lab (SPCL)
Department of Computer Science, ETH Zurich
Website <https://spcl.inf.ethz.ch> X(Twitter)
<https://twitter.com/spcl_eth> YouTube <https://www.youtube.com/@spcl>
GitHub <https://github.com/spcl>
Mark Silberstein
The Scalable Parallel Computing Lab's *SPCL_Bcast* seminar continues
with *Mark Silberstein**of **Technion* presenting on *The evolution of
accelerator-centric GPU services - past, present, future.*. Everyone is
welcome to attend (over Zoom)!
*When:* Thursday, 28th November, 6PM CET
*Where:* Zoom
Join <https://spcl.inf.ethz.ch/Bcast/join>
*Abstract:* GPUs have come a long way, evolving from gaming processors
to the main driving force behind modern AI systems. However, from a
system design perspective, they remain co-processors: they cannot
operate independently of the host CPU, which is necessary to invoke
kernels, manage GPU memory, perform data transfers, and interact with
I/O devices. Thus, beyond the complexity of optimizing individual
kernels, GPU-accelerated application development faces fundamental
challenges in integrating GPU computations into complex data and control
flows involving networking and storage. Since 2013, my students in the
Accelerated Computing Systems Group (https://acsl.group) have been
exploring an alternative, accelerator-centric system design in which a
GPU runs specially crafted OS layers that allow GPU kernels to access
files, storage devices, SmartNICs, and network services, without CPU
involvement in the data and/or control path. We have demonstrated how
such an approach simplifies the programming burden and achieves high
performance. In this talk, I will survey the key ideas of the
accelerator-centric design, discuss the main takeaways, and explore
future trends.
*Biography:* Mark Silberstein is a professor in the Electrical and
Computer Engineering Department at the Technion - Israel Institute of
Technology. His research interests span a broad range of topics in
computer systems, including OS, networking, computer architecture, and
systems security. His projects have been published in top systems
venues, with some winning awards and others being adopted by the
industry. He regularly serves on the program committees of leading
systems conferences, including as a program co-chair of Eurosys '24 and
ASPLOS '26.
More details & future talks <https://spcl.inf.ethz.ch/Bcast/>
Scalable Parallel Computing Lab (SPCL)
Department of Computer Science, ETH Zurich
Website <https://spcl.inf.ethz.ch> X(Twitter)
<https://twitter.com/spcl_eth> YouTube <https://www.youtube.com/@spcl>
GitHub <https://github.com/spcl>
Oskar Mencer
The Scalable Parallel Computing Lab's *SPCL_Bcast* seminar continues
with *Oskar Mencer**of **Groq* presenting on *Programming Groq LPUs
without IEEE Floating Point*. Everyone is welcome to attend (over Zoom)!
*When:* Thursday, 2nd May, 6PM CET
*Where:* Zoom
Join <https://spcl.inf.ethz.ch/Bcast/join>
*Abstract:* The IEEE standard has been a great advance in the early days
of software. In these early days, the speed of software development was
imperative. The Intel x86 instruction set became a standard as well as
IEEE Floating point. Today, we have the first commodity computing
application, the LLM, and others are rapidly following. In the commodity
economy, efficiency and cost become the utmost imperative. As we are
giving up on the x86 instruction set, we have to also consider custom
number representations for each variable in our programs, opening the
world of Physics and Computer Science to a new dimension in computing
(as predicted in my talk at ETH in 2000). In this talk I will cover how
to find the (locally) optimal range and precision for each variable, and
how to optimally utilize custom precision arithmetic units in modern
leading compute chips such as the Groq LPU.
*Biography:* Oskar Mencer got a PhD in Computer Engineering from
Stanford University in 2000, interviewed unsuccessfully at ETH for an
Assistant Professor position, joined Bell Labs 1127, then became EPSRC
Advanced Fellow at Imperial, started Maxeler Technologies, and later got
major investments among others from JP Morgan and CME Group. Maxeler was
recently acquired by Groq, the leading AI inference company in
California. Oskar remains CEO of Maxeler, a Groq Company and now lives
on Palm Jumeirah in Dubai.
More details & future talks <https://spcl.inf.ethz.ch/Bcast/>
Scalable Parallel Computing Lab (SPCL)
Department of Computer Science, ETH Zurich
Website <https://spcl.inf.ethz.ch> X(Twitter)
<https://twitter.com/spcl_eth> YouTube <https://www.youtube.com/@spcl>
GitHub <https://github.com/spcl>
Petar Veličković
The Scalable Parallel Computing Lab's *SPCL_Bcast* seminar continues
with *Petar Veličković**of **DeepMind, and University of Cambridge*
presenting on *Capturing Computation with Algorithmic Alignment*.
Everyone is welcome to attend (over Zoom)!
*When:* Thursday, 21st March, 6PM CET
*Where:* Zoom
Join <https://spcl.inf.ethz.ch/Bcast/join>
*Abstract:* What makes a neural network better, or worse, at fitting
certain tasks? This question is arguably at the heart of neural network
architecture design, and it is remarkably hard to answer rigorously.
Over the past few years, there have been a plethora of attempts, using
various facets of advanced mathematics, to answer this question under
various assumptions. One of the most successful directions --
algorithmic alignment -- assumes that the target function, and a
mechanism for computing it, are completely well-defined and known (i.e.
the target is to learn to execute an algorithm). In this setting,
fitting a task is equated to capturing the computations of an algorithm,
inviting analyses from diverse branches of mathematics and computer
science. I will present some of my personal favourite works in
algorithmic alignment, along with their implications for building
intelligent systems of the future.
*Biography:* Petar is a Staff Research Scientist at Google DeepMind, an
Affiliated Lecturer at the University of Cambridge, and an Associate of
Clare Hall, Cambridge. He holds a PhD in Computer Science from the
University of Cambridge (Trinity College), obtained under the
supervision of Pietro Liò. His research concerns geometric deep
learning—devising neural network architectures that respect the
invariances and symmetries in data (a topic he’s co-written a proto-book
about). For his contributions, he is recognized as an ELLIS Scholar in
the Geometric Deep Learning Program. Particularly, he focuses on graph
representation learning and its applications in algorithmic reasoning
(featured in VentureBeat). He is the first author of Graph Attention
Networks—a popular convolutional layer for graphs—and Deep Graph
Infomax—a popular self-supervised learning pipeline for graphs (featured
in ZDNet). His research has been used in substantially improving
travel-time predictions in Google Maps (featured in CNBC, Endgadget,
VentureBeat, CNET, the Verge, and ZDNet), and guiding the intuition of
mathematicians towards new top-tier theorems and conjectures (featured
in Nature, Science, Quanta Magazine, New Scientist, The Independent, Sky
News, The Sunday Times, la Repubblica, and The Conversation).
More details & future talks <https://spcl.inf.ethz.ch/Bcast/>
Scalable Parallel Computing Lab (SPCL)
Department of Computer Science, ETH Zurich
Website <https://spcl.inf.ethz.ch> X(Twitter)
<https://twitter.com/spcl_eth> YouTube <https://www.youtube.com/@spcl>
GitHub <https://github.com/spcl>
The Scalable Parallel Computing Lab's *SPCL_Bcast* seminar continues
with *Albert Cohen of Google* presenting on *Can I Cook a 5 o'clock
Compiler Cake and Eat It at 2?* Everyone is welcome to attend (over Zoom)!
*When:* Thursday, 7th December, 9AM CET
*Where:* Zoom
Join <https://spcl.inf.ethz.ch/Bcast/join>
*Abstract:* In high-performance computing words: can we build a compiler
that will eventually save a lot of performance engineering effort while
immediately delivering competitive results? Here, competitiveness refers
to achieving near hardware peak-performance for important applications.
The question is particularly hot in a domain-specific setting, where the
building blocks for constructing an effective optimizing compiler may be
inadequate, too generic, or too low-level. It is widely understood that
compiler construction has failed to deliver early afternoon sweets. I
personally feel bad about it, but until recently it remained an academic
exercise to challenge the status quo. Maybe it is now time to reconsider
this assumption: ML-enhanced compilers become the norm rather than the
exception. New compiler frameworks reconcile optimizations for the
common case with application-specific performance. Domain-specific code
generators play an essential role in the implementation of dense and
sparse numerical libraries. But even with the help of domain-specific
compilers, peak performance can only be achieved at the expense of a
dramatic loss of programmability. Are we ever going to find a way out of
this programmability/performance dilemma? What about the velocity and
agility of compiler engineers? Can we make ML-based heuristics scalable
enough to compile billions of lines of code? Can we do so while enabling
massive code reuse across domains, languages and hardware? We will
review these questions, based on recent successes and half-successes in
academia and industry. We will also form an invitation to tackle these
challenges in future research and software development.
*Biography:* Albert Cohen is a research scientist at Google. An alumnus
of École Normale Supérieure de Lyon and the University of Versailles, he
has been a research scientist at Inria, a visiting scholar at the
University of Illinois, an invited professor at Philips Research, and a
visiting scientist at Facebook Artificial Intelligence Research. Albert
works on parallelizing, optimizing and machine learning compilers, and
on dataflow and synchronous programming languages, with applications to
high-performance computing, artificial intelligence and reactive control.
More details & future talks <https://spcl.inf.ethz.ch/Bcast/>
Scalable Parallel Computing Lab (SPCL)
Department of Computer Science, ETH Zurich
Website <https://spcl.inf.ethz.ch> X(Twitter)
<https://twitter.com/spcl_eth> YouTube <https://www.youtube.com/@spcl>
GitHub <https://github.com/spcl>
SPCL_Bcast Virginia Smith
The Scalable Parallel Computing Lab's *SPCL_Bcast* seminar continues
with *Marian Verhelst of KU Leuven* presenting on *Heterogeneous
multi-core systems for efficient EdgeML*. Everyone is welcome to attend
(over Zoom)!
When: Thursday, 26th October, 9AM CET
Where: Zoom
Join <https://spcl.inf.ethz.ch/Bcast/join>
Abstract: Embedded ML applications are characterized by increasingly
diverse workloads, forming a rich mixture of signal processing, GeMM and
conv kernels, attention layers, and even graph processing. Accelerator
efficiency suffers from supporting this wide variety of kernels.
Heterogeneous multicore systems can offer a solution but come with their
own challenges, such as: 1.) How to find the most optimal combination of
cores?; 2.) How to efficiently map workloads across cores?; 3.) How to
share data between these cores? This talk will report on a heterogeneous
multi-core system for embedded neural network processing taped out at
KULeuven MICAS. Moreover, it will give an outlook on work in progress
towards further expanding this system for covering more workloads and
more heterogeneous cores.
Biography: Marian Verhelst is a full professor at the MICAS laboratories
of KU Leuven and a research director at imec. Her research focuses on
embedded machine learning, hardware accelerators, HW-algorithm co-design
and low-power edge processing. She received a PhD from KU Leuven in
2008, and worked as a research scientist at Intel Labs, Hillsboro OR
from 2008 till 2010. Marian is a member of the board of directors of
tinyML and active in the TPC’s of DATE, ISSCC, VLSI and ESSCIRC and was
the chair of tinyML2021 and TPC co-chair of AICAS2020. Marian is an IEEE
SSCS Distinguished Lecturer, was a member of the Young Academy of
Belgium, an associate editor for TVLSI, TCAS-II and JSSC and a member of
the STEM advisory committee to the Flemish Government. Marian received
the laureate prize of the Royal Academy of Belgium in 2016, the 2021
Intel Outstanding Researcher Award, the André Mischke YAE Prize for
Science and Policy in 2021, and two ERC grants.
More details & future talks <https://spcl.inf.ethz.ch/Bcast/>
Scalable Parallel Computing Lab (SPCL)
Department of Computer Science, ETH Zurich
Website <https://spcl.inf.ethz.ch> X(Twitter)
<https://twitter.com/spcl_eth> YouTube <https://www.youtube.com/@spcl>
GitHub <https://github.com/spcl>