Productive parallel programming for FPGA with HLS

Title: Productive parallel programming for FPGA with high-level synthesis
Speakers: Johannes de Fine Licht and Torsten Hoefler
Venue: The International Conference for High Performance Computing, Networking, Storage, and Analysis 2018 [SC'18]
Time: 13:30 - 17:00, Sunday November 11th, 2018
Room: C144

Slides: Part 0 [Introduction], Part 1 [Technical]
Example code: available on github
HLS extensions: available on github

A virtual machine containing the tools necessary to synthesize the examples for the tutorial is available here (60 GB download).

Key techniques used in this tutorial are included in our paper, Transformations of High-Level Synthesis Codes for High-Performance Computing [1], available on arXiv.

The previous iteration of this tutorial was given at PPoPP'18 and at ETH Zurich. The tutorial at SC'18 will include additional improvements based on feedback from these.

Abstract: As the scale of large high performance computing systems increases, so does their power consumption, making energy efficiency a first class citizen in their design. While GPUs and custom processors have improved this situation significantly, reconfigurable architectures, such as FPGAs, promise another major step in energy efficiency, constituting a middle ground between fixed hardware architectures and custom-built ASICs. Programming FPGAs has traditionally been done in hardware description languages, requiring extensive hardware knowledge and significant engineering effort. This tutorial shows how high-level synthesis (HLS) can be harnessed to productively achieve scalable pipeline parallelism on FPGAs. Attendees will learn how to target FPGA resources from high-level C++ or OpenCL code, guiding the mapping from imperative code to hardware, enabling them to develop massively parallel designs with real performance benefits. We treat concrete examples well known from the software world, relating traditional code optimizations to both corresponding and new transformations for hardware, building on existing knowledge when introducing new topics. By bridging the gap between software and hardware optimization, our tutorial aims to enable developers from a larger set of backgrounds to start tapping into the potential of FPGAs with real high performance codes.


This tutorial will cover modeling, designing and implementing FPGA hardware using a modern HLS tool. We include aspects of performance modeling, but the majority of the tutorial will focus on practice, with the goal of enabling attendees to start writing parallel hardware of their own. After an introduction to dataflow architectures and an overview of existing technology in the domain of reconfigurable architectures, the programming part starts from the point of view of hardware description languages, then zooms out to the level of abstraction offered by HLS languages.

We use code examples to introduce central properties of the mapping from imperative languages to hardware, and the performance aspects implied by this transformation. Examples will be demonstrated with small to medium length live coding sessions interleaved with the presentation, such that attendees are faced with real coding at a pace that is easy to follow. All examples shown will be made available to attendees, as well as a virtual machine containing the relevant HLS tool, allowing them to follow and do further experiments before, during and/or after the tutorial.

All concepts demonstrated will be put in the context of HPC by highlighting their performance impact, constituting an approach to HLS that is built around the main strength of the FPGA architecture: pipeline parallelism. We show how to resolve common complications that arise when designing pipelines, such as loop-carried dependencies and memory interface contention, and how pipelined designs can be scaled up to exploit the full performance potential of the FPGA. Towards the end of the tutorial we re-iterate the content by applying the concepts presented to an HPC application, demonstrating how the design can be scaled to near peak performance on any modern FPGA accelerator.

Finally we discuss the limitations of HLS (that are not usually advertised by vendors), when lower-level programming is relevant, and the difference between the OpenCL and C++ variations of HLS, and give some tips on how to approach new problems.

Nallatech 520N (Intel Stratix 10).

Xilinx VCU1525 (Ultrascale+ VU9P).


The presentation is interleaved with live demos, exposing attendees to real code synthesized live with an HLS tool. The source code and makefiles, and a virtual machine with the HLS tool, will be made available to attendees, although limitations imposed by long kernel build times, licensing and access to FPGA boards, we will not ask attendees to execute kernels in hardware during the tutorial.

The examples presented throughout the tutorial can be cloned from github:

We provide a virtual machine with the necessary tools pre-installed. Alternatively, the tools can be downloaded and installed from the Xilinx and Intel websites (requires registering free accounts with Xilinx and Intel, respectively).

Prerequisite knowledge

This tutorial is aimed at audiences coming from an HPC background, interested in programming FPGAs for massive spatial parallelism. To benefit from the tutorial, the following prior knowledge is suggested:
  • Proficiency in C++
  • Basics of FPGA architecture
  • Performance optimization techniques on CPU/GPU (tiling, unrolling, vectorization)


[1] J. de Fine Licht, M. Besta, S. Meierhans, T. Hoefler:
 Transformations of High-Level Synthesis Codes for High-Performance Computing CoRR. Vol abs/1805.08288, May 2018,