Data-Centric Parallel Programming

Decoupling domain science from performance optimization.

The DAPP project aims to build new representations for programs and algorithms, in order to efficiently map them to the entire hardware architecture landscape (CPU, GPU, and FPGA) with high utilization. With data-centric parallel programming, we enable direct knowledge transfer of performance optimization, regardless of the scientific application or the target processor.

Publications

  • Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs
    Tal Ben-Nun, Johannes de Fine Licht, Alexandros Nikolaos Ziogas, Timo Schneider, Torsten Hoefler. Supercomputing 2019. Paper | Slides | Video
  • Optimizing the Data Movement in Quantum Transport Simulations via Data-Centric Parallel Programming
    Alexandros Nikolaos Ziogas, Tal Ben-Nun, Guillermo Indalecio Fernández, Timo Schneider, Mathieu Luisier, Torsten Hoefler. Supercomputing 2019.
  • A Data-Centric Approach to Extreme-Scale Ab initio Dissipative Quantum Transport Simulations
    Alexandros Nikolaos Ziogas, Tal Ben-Nun, Guillermo Indalecio Fernández, Timo Schneider, Mathieu Luisier, Torsten Hoefler. Supercomputing 2019. Won ACM Gordon Bell Prize.

DaCe

The DAPP project includes the DaCe (Data-Centric) programming environment. Several languages and frameworks can be compiled with DaCe, including Python (with a numpy-like interface), TensorFlow, and a subset of MATLAB. At the core of DaCe is its data-centric intermediate representation, called the Stateful Dataflow Multigraph (SDFG), which is a state machine of dataflow graphs. Below we see a matrix multiplication in its declarative form in the DaCe numpy interface, and the resulting SDFG:

@dace.program
def gemm(A: dace.float64[M, K], B: dace.float64[K, N], 
         C: dace.float64[M, N]):
    # Transient variable
    tmp = dace.define_local([M, N, K], dtype=A.dtype)

    @dace.map
    def mult(i: _[0:M], j: _[0:N], k: _[0:K]):
        in_A << A[i, k]
        in_B << B[k, j]
        out >> tmp[i, j, k]

        out = in_A * in_B

    dace.reduce(lambda a, b: a + b, tmp, C, axis=2)
Matrix Multiplication SDFG (Declarative)

In the graph, the blue region represents one state in the state machine. Circular nodes (A, B, tmp, C) are Arrays, hexagons (mult) are code Tasklets, triangles represent reductions, and the edges represents the data moved between the nodes. To explicitly mark and enable parametric parallelism, SDFGs define Map scopes (between the trapezoid nodes) with ranges.

The above version consumes N*M*K additional memory for tmp, which is not scalable. If we run the script, however, DaCe can transform the SDFG by fusing the map and reduce nodes, as follows:

$ python3 gemm.py
==== Program start ====
Matrix multiplication 1024x1024x1024
Automatically applied 2 strict state fusions and removed 0 redundant arrays.
0. Pattern FPGATransformCubed in gemm
...
8. Pattern MapReduceFusion in 3 -> 4 -> 5
...
Select the pattern to apply (0 - 11 or name$id): 8
You selected (8) pattern MapReduceFusion in 3 -> 4 -> 5 with parameters {}

We then obtain the following SDFG, which can also be manually programmed in Python (left-hand side).

@dace.program
def gemm(A: dace.float64[M, K], B: dace.float64[K, N], 
         C: dace.float64[M, N]):
    @dace.map
    def mult(i: _[0:M], j: _[0:N], k: _[0:K]):
        in_A << A[i, k]
        in_B << B[k, j]
        out >> C(1, lambda a,b: a+b)[i, j]

        out = in_A * in_B
Matrix Multiplication SDFG (Conflict-resolution)

Transforming the graph, both manually and automatically, can yield performance that is 98.6% of Intel MKL on CPU, 90% of NVIDIA CUTLASS on GPU, and 4,992x faster than High-Level Synthesis using Xilinx SDAccel on FPGA (see Performance for more results). An implicit matrix multiplication produces a reasonably fast version automatically:

@dace.program
def gemm(A: dace.float64[M, K], B: dace.float64[K, N], C: dace.float64[M, N]):
  C = A @ B

Performance

These are the runtimes of DaCe compiled from SDFGs on different fundamental computational kernels: Query with a predicate, Matrix Multiplication (MM), Histogram, 2D Jacobi Stencil, and Sparse Matrix-Vector multiplication (SpMV). We test CPUs, GPUs, and FPGAs.

CPU Performance
Intel Xeon E5-2650 v4 (12 cores)

GPU Performance
NVIDIA Tesla P100

FPGA Performance
Xilinx XCVU9P FPGA

Team

Johannes de Fine Licht
Johannes de Fine Licht
Alexandros-Nikolaos Ziogas
Alexandros-Nikolaos Ziogas
Timo Schneider
Timo Schneider