Productive parallel programming for FPGA with HLS

Title: Productive parallel programming for FPGA with high-level synthesis
Venue: Principles and Practice of Parallel Programming 2018 [PPoPP'18]
Speakers: Johannes de Fine Licht and Torsten Hoefler
Time: 08:30 - 12:00, Sunday February 25, 2018

Slides: available as pdf

Example code: available on github (or as a tarball snapshot)

HLS extensions: available on github


Johannes also presented a poster on Designing scalable FPGA architectures using high-level synthesis [1] at the PPoPP/CGO/HPCA'18 poster session (link to PDF in reference).



Abstract: As the scale of high performance computing systems increases, so does their power consumption, making energy efficiency an increasingly important consideration in their design. While GPUs and custom processors have improved this situation significantly, FPGAs promise another major step in energy efficiency, representing a middle ground between fixed hardware architectures and custom built ASICs. Programming FPGAs has traditionally been done in hardware description languages, requiring extensive hardware knowledge and significant engineering effort. This tutorial shows how high-level synthesis (HLS) can be harnessed to efficiently exploit spatial parallelism on FPGAs, while preserving programmer productivity. Attendees will learn how to target available FPGA resources with high-level C++ constructs, and control and guide the mapping from imperative code to hardware, enabling them to develop massively parallel designs by identifying and implementing patterns suitable for spatial parallelism. We will establish the central concepts of HLS necessary to achieve an efficient hardware implementation, then show how performance modeling and more advanced programming techniques can be used to optimize it further. By enabling the design of efficient FPGA implementations in a high level language, our tutorial seeks to bridge the gap between software and hardware development, allowing programmers from a larger set of backgrounds to begin tapping into the potential of FPGAs.




We must pipeline both local loop schedules and the global dataflow for maximum throughput.



Content


FPGAs in the context of HPC are in a phase similar to what GPUs were in 10 years ago: their potential is recognized, but there is a limited amount of practical experience and success stories in the HPC community. Like CUDA facilitated the widespread adoption of GPUs, various high-level synthesis (HLS) tools are contenders to lead to a similar breakthrough for FPGAs. Although they promise high productivity, the fundamentally different nature of hardware design means that these tools are harder to approach than their software counterparts. In particular, existing documentation rarely touches upon the issues that appear when we push the chip to its performance limit. We bring practical experience in working with HPC code on FPGA, combined with many years of HPC experience, to accelerate attendees into the world of scalable FPGA hardware design.

This tutorial will cover modeling, designing and implementing FPGA hardware using a modern HLS tool. Comments on performance modeling carry a theoretical aspect, but the majority of the tutorial will focus on practice, with the goal of enabling attendees to start writing parallel hardware of their own. After an introduction to dataflow architectures and an overview of existing technology in the domain of reconfigurable architectures, the programming part starts from the point of view of hardware description languages, then zooms out to the level of abstraction offered by HLS languages.

We use code examples to introduce central properties of the mapping from imperative languages to hardware, and the performance aspects implied by this transformation. Examples will be demonstrated with small to medium length live coding sessions interleaved with the presentation, such that attendees are faced with real coding at a pace that is easy to follow. All code shown will be made available to tutorial attendees.

All concepts demonstrated will be put in the context of HPC by highlighting their performance impact, constituting an approach to HLS that is built around the main strength of the FPGA architecture: pipeline parallelism. We show how to resolve common complications that arise when designing pipelines, such as loop-carried dependencies and memory interface contention, and how pipelined designs can be scaled up to exploit the full performance potential of the FPGA. Finally we re-iterate the tutorial content by applying the concepts presented to an HPC application, demonstrating how the design can be scaled to near peak performance on a large FPGA accelerator.

In the last part we explore the tool landscape of FPGA design, discussing the limitations of HLS, when lower-level programming is relevant, and the difference between the OpenCL and C++ variations of HLS.



Nallatech 520N (Intel Stratix 10).


Xilinx VCU1525 (Ultrascale+ VU9P).



Hands-on


The presentation will be interleaved with live coding sessions to introduce each new concept in a practical manner. Attendees will have access to the examples shown, allowing them to follow the tutorial on their own computers, either during or after the tutorial takes place.

The examples presented throughout the tutorial can be cloned from github:
https://github.com/spcl/hls_tutorial_examples

The tools used for the examples are available on the Xilinx website. A trial version (called "Webpack") is available without a commercial license.

Prerequisite knowledge


This tutorial is aimed at all audiences coming from an HPC background, interested in programming FPGAs for massive spatial parallelism. A basic understanding of hardware architectures is expected, but no practical experience with FPGAs or hardware design is required for the majority of the tutorial. Attendees with experience in HLS will benefit most from the material covered in the second half of the session.



References

PPoPP'18
[1] J. de Fine Licht, M. Blott, T. Hoefler:
 Designing scalable FPGA architectures using high-level synthesis In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, presented in Vienna, Austria, pages 403--404, ACM, ISBN: 978-1-4503-4982-6, Feb. 2018,