Productive parallel programming for FPGA with HLS

Title: Productive parallel programming for FPGA with high-level synthesis
Location: ETH Zurich, ML H 37.1
Speakers: Johannes de Fine Licht and Torsten Hoefler
Time: 14:00 - 17:30, Tuesday May 15, 2018


Registration: through this form (maximum 40 registrations can be accommodated)


Slides: available as pdf

Example code: available on github (or as a tarball snapshot)

HLS extensions: available on github


Abstract: As the scale of large high performance computing systems increases, so does their power consumption, making energy efficiency a first class citizen in their design. While GPUs and custom processors have improved this situation significantly, reconfigurable architectures, such as FPGAs, promise another major step in energy efficiency, constituting a middle ground between fixed hardware architectures and custom-built ASICs. Programming FPGAs has traditionally been done in hardware description languages, requiring extensive hardware knowledge and significant engineering effort. This tutorial shows how high-level synthesis (HLS) can be harnessed to productively achieve scalable pipeline parallelism on FPGAs. Attendees will learn how to target FPGA resources from high-level C++ or OpenCL code, guiding the mapping from imperative code to hardware, enabling them to develop massively parallel designs with real performance benefits. We treat concrete examples well known from the software world, relating traditional code optimizations to both corresponding and new transformations for hardware, building on existing knowledge when introducing new topics. By bridging the gap between software and hardware optimization, our tutorial aims to enable developers from a larger set of backgrounds to start tapping into the potential of FPGAs with real high performance codes.




We must pipeline both local loop schedules and the global dataflow for maximum throughput.



Content


This tutorial will cover modeling, designing and implementing FPGA hardware using a modern HLS tool. We include aspects of performance modeling, but the majority of the tutorial will focus on practice, with the goal of enabling attendees to start writing parallel hardware of their own. After an introduction to dataflow architectures and an overview of existing technology in the domain of reconfigurable architectures, the programming part starts from the point of view of hardware description languages, then zooms out to the level of abstraction offered by HLS languages.

We use code examples to introduce central properties of the mapping from imperative languages to hardware, and the performance aspects implied by this transformation. Examples will be demonstrated with small to medium length live coding sessions interleaved with the presentation, such that attendees are faced with real coding at a pace that is easy to follow. All examples shown will be made available to attendees, as well as a virtual machine containing the relevant HLS tool, allowing them to follow and do further experiments before, during and/or after the tutorial.

All concepts demonstrated will be put in the context of HPC by highlighting their performance impact, constituting an approach to HLS that is built around the main strength of the FPGA architecture: pipeline parallelism. We show how to resolve common complications that arise when designing pipelines, such as loop-carried dependencies and memory interface contention, and how pipelined designs can be scaled up to exploit the full performance potential of the FPGA. Towards the end of the tutorial we re-iterate the content by applying the concepts presented to an HPC application, demonstrating how the design can be scaled to near peak performance on any modern FPGA accelerator.

Finally we discuss the limitations of HLS (that are not usually advertised by vendors), when lower-level programming is relevant, and the difference between the OpenCL and C++ variations of HLS, and give some tips on how to approach new problems.

A summary of important optimization techniques was published at PPoPP 2018 [1].



Nallatech 520N (Intel Stratix 10).


Xilinx VCU1525 (Ultrascale+ VU9P).



Hands-on


The presentation is interleaved with live demos, exposing attendees to real code synthesized live with an HLS tool. The source code and makefiles, and a virtual machine with the HLS tool, will be made available to attendees, although limitations imposed by long kernel build times, licensing and access to FPGA boards, we will not ask attendees to execute kernels in hardware during the tutorial.

The examples presented throughout the tutorial can be cloned from github:
https://github.com/spcl/hls_tutorial_examples

The tools used for the examples are available on the Xilinx website. A trial version (called "Webpack") is available without a commercial license.

Prerequisite knowledge


This tutorial is aimed at audiences coming from an HPC background, interested in programming FPGAs for massive spatial parallelism. To benefit from the tutorial, the following prior knowledge is suggested:
  • Proficiency in C++
  • Basics of FPGA architecture
  • Performance optimization techniques on CPU/GPU (tiling, unrolling, vectorization)



References

PPoPP'18
[1] J. de Fine Licht, M. Blott, T. Hoefler:
 Designing scalable FPGA architectures using high-level synthesis In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, presented in Vienna, Austria, pages 403--404, ACM, ISBN: 978-1-4503-4982-6, Feb. 2018,