Productive parallel programming for FPGA with HLS

Title: Productive Parallel Programming for FPGA with High-Level Synthesis
Speakers: Johannes de Fine Licht and Torsten Hoefler
Venue: The International Conference for High Performance Computing, Networking, Storage, and Analysis 2019 [SC'19]
Time: 1:30 PM — 5:00 PM, Sunday, November 17th, 2019
Location: 407, Colorado Convention Center, Denver, CO

Slides: Part 0 [Introduction], Part 1 [Technical]
Example code: available on github
HLS extensions: available on github

A virtual machine containing the tools necessary to synthesize the examples for the tutorial is available here (46 GB download, 96 GB decompressed).

Key techniques used in this tutorial are included in our paper, Transformations of High-Level Synthesis Codes for High-Performance Computing [1], available on arXiv.

This tutorial was previously given at SC'18 and PPoPP'18. This iteration will feature an updated slide deck, and improved example codes for both Xilinx and Intel ecosystems.

Abstract: Energy efficiency has become a first class citizen in the design of large computing systems. While GPUs and custom processors have shown merit in this regard, reconfigurable architectures, such as FPGAs, promise another major step in energy efficiency, constituting a middle ground between fixed hardware architectures and custom-built ASICs. Programming FPGAs has traditionally been done in hardware description languages, requiring extensive hardware knowledge and significant engineering effort. This tutorial shows how high-level synthesis (HLS) can be harnessed to productively achieve scalable pipeline parallelism on FPGAs. Attendees will learn how to target FPGA resources from high-level C++ or OpenCL code, guiding the mapping from imperative code to hardware, enabling them to develop massively parallel designs with real performance benefits. We treat well-known examples from the software world, relating traditional code optimizations to both corresponding and new transformations for hardware, building on existing knowledge when introducing new topics. By bridging the gap between software and hardware optimization, our tutorial aims to enable developers from a larger set of backgrounds to start tapping into the potential of FPGAs with real high performance codes.


We cover modeling, designing and implementing FPGA hardware using modern HLS tools. We include aspects of performance modeling, but the majority of the tutorial will focus on practice, with the goal of enabling attendees to start writing parallel hardware of their own.

After an introduction to dataflow architectures and an overview of existing technology in the domain of reconfigurable architectures, we move to the practical seciton by introducing the HLS abstraction. We use frequent code examples to introduce central properties of the mapping from imperative languages to hardware, and the performance aspects implied by this transformation. Examples will be demonstrated with small to medium length live coding sessions interleaved with the presentation, such that attendees are faced with real coding at a pace that is easy to follow. All examples shown are available to attendees, along with a virtual machine containing the relevant HLS tool, allowing them to follow and do further experiments before, during and/or after the tutorial.

All concepts demonstrated will be put in the context of HPC by highlighting their performance impact, constituting an approach to HLS that is built around the main strength of the FPGA architecture: pipeline parallelism. We show how to resolve common complications that arise when designing pipelines, such as loop-carried dependencies and memory interface contention, and how pipelined designs can be scaled up to exploit the full performance potential of the FPGA. After the break, we apply the concepts presented to an HPC application, demonstrating how an HLS design can be scaled up fully utilize a modern FPGA accelerator.

We close the tutorial with a discussion of the limitations of HLS (that are not usually advertised by vendors), and an overview of the current vendor, hardware, and toolset landscape.

Nallatech 520N (Intel Stratix 10).

Xilinx VCU1525 (Ultrascale+ VU9P).


The presentation is interleaved with live demos, exposing attendees to real code, synthesized live with an HLS tool by the presenter.

We provide a virtual machine with the necessary tools pre-installed. Alternatively, the tools can be downloaded and installed from the Xilinx and Intel websites (requires registering free accounts with Xilinx and Intel, respectively).

Prerequisite knowledge

This tutorial is aimed at audiences coming from an HPC background, interested in programming FPGAs for massive spatial parallelism. To benefit from the tutorial, the following prior knowledge is suggested:
  • Proficiency in C++
  • Basics of FPGA architecture
  • Performance optimization techniques on CPU/GPU (tiling, unrolling, vectorization)


[1] J. de Fine Licht, M. Besta, S. Meierhans, T. Hoefler:
 Transformations of High-Level Synthesis Codes for High-Performance Computing CoRR. Vol abs/1805.08288, May 2018,