Directed Portion (100%)

This lab focuses on writing programs that target the RISC-V vector ISA. This will involve:

  1. Writing vector assembly code for different benchmarks

  2. Testing their correctness and estimating performance using the RISC-V ISA simulator, spike

Although normally just a functional simulator, to create a more interesting lab, spike has been extended with a rudimentary timing model of a single-issue in-order core with a standard vector unit. This timing model approximates instruction latencies from data and structural hazards, multi-cycle functional units, cache misses, and branch mispredictions.

For these simulations, spike is configured to model the following hardware parameters, intended to closely match the default RocketConfig design from Chipyard.

Vector unit:

  • 512-bit hardware vector length (VLEN), 128-bit datapath

  • Two vector functional units (VFU0, VFU1)

    • Both VFUs contain an integer ALU, an integer multiplier, and floating-point FMA units

    • VFU0 handles reductions

    • VFU1 handles floating-point conversions and comparisons

  • One vector memory unit (VMU)

    • 128-bit interface to L1 data cache

    • 128 bits/cycle bandwidth for unit-stride memory operations

    • 1 element/cycle bandwidth for constant-stride and indexed memory operations (including vector AMOs)

  • In-order issue with flexible vector chaining

Functional units:

  • Integer ALU, 1-cycle latency

  • Unpipelined scalar integer multiplier (8 bits/cycle) and divider (1 bit/cycle)

  • Pipelined vector integer multiplier, 3-cycle latency

  • Pipelined scalar and vector floating-point units

    • 4-cycle double-precision FMA latency

    • 3-cycle single-precision and half-precision FMA latency

    • 2-cycle floating-point conversion and comparison latency

  • Unpipelined vector reduction unit, 1 element/cycle

Memory hierarchy:

  • L1 instruction cache: 16 KiB, 4-way, 64 B lines

  • Blocking L1 data cache: 16 KiB, 4-way, 64 B lines, 2-cycle latency for scalar loads

  • Inclusive L2 cache: 512 KiB, 8-way, 64 B lines, 20-cycle latency

  • 75-cycle main memory latency1

Branch prediction:

  • 3-cycle branch misprediction penalty

  • 28-entry fully-associative BTB

  • 512-entry BHT, gshare with 8 bits of global history

  • 6-entry RAS

Setup

To complete this lab, ssh into an instructional server with the instructional computing account provided to you. The lab infrastructure has been set up to run on the eda{1..3}.eecs.berkeley.edu machines (eda-1.eecs, eda-2.eecs, etc.).

Once logged in, source the following script to initialize your shell environment so as to be able to access to the tools for this lab.2 Run it before each session.

cd ~ # go to your home directory
source ~cs152/sp26/cs152.lab4.bashrc

First, clone the lab materials into an appropriate workspace and initialize the submodules.

git clone https://github.com/cs152-teach/chipyard-cs152.git -b sp26-lab4 lab4
cd lab4

Register Convention

When writing assembly code, strictly adhere to the integer and floating-point register conventions set forth by the RISC-V psABI3 (processor-specific application binary interface). Inadvertently clobbering registers will cause compatibility issues when linked with compiled code.

The x registers s0s11 are callee-saved, which should be preserved across function calls by saving on the stack and restoring them if used. t0t6 and a0a7 can be used as temporaries. gp and tp are reserved for special purposes in the execution environment and should be avoided. Similarly for the f registers, fs0fs11 are callee-saved. ft0ft11 and fa0fa7 can be used as temporaries.

Currently, all vector registers v0v31 are treated as caller-saved.

Conditional SAXPY (csaxpy)

This first kernel is intended to provide you with an idea of how to design a kernel using the RISC-V Vector ISA.

The full vector code for csaxpy is already provided to you in benchmarks/vec-csaxpy/vec_csaxpy.S. It is essentially identical to the example described earlier in Section 2.4.

Take a moment to study how it works; although relatively simple, it is a useful demonstration of some important ISA features such as SEW, LMUL, and predication.

For comparison, the scalar version is also available in benchmarks/csaxpy/.

Build and run both benchmarks on spike as follows:

cd benchmarks
make csaxpy.riscv.out
make vec-csaxpy.riscv.out

Now that you understand the infrastructure, how to run benchmarks, and how to collect results, you can begin writing your own benchmarks.

Directed Portion Vector Kernels

For the directed portion submission, pick 2 of the 4 kernels below to implement. Once done, follow the directions in Section 3.9 to submit your code to the Gradescope assignment.

  • (Directed) Problem 3.5: cmplxmult code

  • (Directed) Problem 3.6: dgemv code

  • (Directed) Problem 3.7: dgemm code

  • (Directed) Problem 3.8: imax code

Complex Vector Multiplication (cmplxmult)

cmplxmult multiplies two vectors of single-precision complex values. The psuedocode is shown below.

struct Complex {
    float real;
    float imag;
};

for (i = 0; i < m; i++) {
    c[i].real = (a[i].real * b[i].real) - (a[i].imag * b[i].imag);
    c[i].imag = (a[i].imag * b[i].real) + (a[i].real * b[i].imag);
}

Build and run the scalar version provided in benchmarks/cmplxmult/:

make cmplxmult.riscv.out

Your task is to vectorize the code. Complete the assembly function in benchmarks/vec-cmplxmult/vec_cmplxmult.S according to the TODO comments.

When you are ready to test your code, build and run it on the ISA simulator:

make vec-cmplxmult.riscv.out

If no errors are reported, you are done!

Segmented Vector Memory Operations

When working with arrays of structs, you may want to use segmented vector memory operations to conveniently unpack each field into separate (consecutively numbered) vector registers. These are described in the https://inst.eecs.berkeley.edu/~cs152/sp22/handouts/sp22/riscv-v-spec-0.10.html#sec-aos section of the RVV spec. 4

Fused Multiply-Add Operations

Although not necessary, more efficient code can be written by using fused multiply-add instructions that issue two vector floating-point operations for each instruction. These come in two destructive forms that overwrite one of the vector register operands, either the addend or the first multiplicand.5 All relevant fused varieties are listed in the https://inst.eecs.berkeley.edu/~cs152/sp22/handouts/sp22/riscv-v-spec-0.10.html#_vector_single_width_floating_point_fused_multiply_add_instructions section.

Double-Precision Generalized Matrix-Vector Multiplication (dgemv)

dgemv performs double-precision matrix-vector multiplication $y \leftarrow A x + y$, where $x$ and $y$ are vectors and $A$ is an $m \times n$ matrix. It is a fundamental kernel part of BLAS (Basic Linear Algebra Subprograms) Level 2. Since the arithmetic intensity is relatively low (each element of $A$ is used only once), its performance is typically memory-bound.

The unoptimized pseudocode is shown below. The matrix A is stored in row-major order, i.e., the entries in a row are contiguous in memory.

for (i = 0; i < m; i++) {
    for (j = 0; j < n; j++) {
        y[i] += A[i][j] * x[j];
    }
}

Build and run the scalar version provided in benchmarks/dgemv/:

make dgemv.riscv.out

Your task is to vectorize the inner loop along the n dimension.6 Complete the assembly function in benchmarks/vec-dgemv/vec_dgemv.S according to the TODO comments. When you are ready to test your code, build and run it on the ISA simulator:

make vec-dgemv.riscv.out

Reductions and Scalars

Note that the inner product to compute y[i] involves a sum reduction. For long vectors particularly, reduction operations can be somewhat expensive due to the inter-element communication required. Thus, the recommended approach is to stripmine the loop to accumulate the partial sums in parallel, and then reduce the vector at the end of the loop to yield a single value.

You may find the https://inst.eecs.berkeley.edu/~cs152/sp22/handouts/sp22/riscv-v-spec-0.10.html#_vector_single_width_floating_point_reduction_instructions section of the RVV spec to be useful.7 Note that the vector reduction instructions interact with scalar values held in vector registers: The scalar result is written to element 0 of the destination vector register, and the operation also takes an initial scalar input from element 0 of a second vector operand. Refer to https://inst.eecs.berkeley.edu/~cs152/sp22/handouts/sp22/riscv-v-spec-0.10.html#_floating_point_scalar_move_instructions for transferring a single value between an f register and a vector register.

Double-Precision Generalized Matrix-Matrix Multiplication (dgemm)

dgemm performs double-precision matrix-matrix multiplication $C \leftarrow A B + C$, where $A$ and $B$ are both $n \times n$ matrices. This is another fundamental kernel for scientific computing and machine learning, part of BLAS Level 3. The unoptimized pseudocode is shown below. All matrices are stored in row-major order.

for (i = 0; i < n; i++) {
    for (j = 0; j < n; j++) {
        for (k = 0; k < n; k++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}

The optimized scalar version is provided in benchmarks/dgemm/. Note how loop unrolling (of i and k) and register blocking expose opportunities for data reuse to improve efficiency. Some extra code is needed to handle the remainder after the unrolled loop. For simplicity, this implementation is not cache-blocked, although doing so would be a straightforward transformation. Build and run the scalar code as follows:

make dgemm.riscv.out

Your task is to vectorize the second loop over j. Submatrices of C and B can be held in vector registers, while the entries of A are loaded as scalars. This naturally leads to using vector-scalar operations to compute the partial products. As a hint, first work through the matrix multiplication by hand to see how the computation can be rearranged into a vectorizable pattern.

Complete the assembly functions for the main inner loop in benchmarks/vec-dgemm/vec_dgemm_inner.S and the remainder loop in benchmarks/vec-dgemm/vec_dgemm_remainder.S. Try to leverage fused multiply-add instructions where possible. When you are ready to test your code, build and run it on the ISA simulator:

make vec-dgemm.riscv.out

Fused Multiply-Add Operations

You may find the https://inst.eecs.berkeley.edu/~cs152/sp22/handouts/sp22/riscv-v-spec-0.10.html#_vector_single_width_floating_point_fused_multiply_add_instructions section of the RVV spec to be useful.

Index of Maximum Element (imax)

In this problem, you will vectorize a less conventional vector application, imax, which finds the index of the largest value in an array. One use case is for identifying the pivot element in certain matrix algorithms, such as Gaussian elimination. The pseudocode is shown below.

idx = 0, max = -INFINITY;
for (i = 0; i < n; i++) {
    if (x[i] > max) {
        max = x[i];
        idx = i;
    }
}

Build and run the scalar version provided in benchmarks/imax/:

make imax.riscv.out

Despite the simplicity of the scalar implementation, vectorizing imax is not as trivial. The following approach is suggested:

  1. Keep the current maximum in an f register, initialized to negative infinity, and the current index in an x register, initialized to zero.

  2. Load a vector and find its maximum with a reduction.

  3. Compare against the global maximum.

  4. Use a vector floating-point comparison to represent the location of the maximum element as a vector mask.

  5. Find the first set bit in the mask using vfirst.m. This yields the index of the lowest-numbered element of the mask vector that has its least-significant bit set, or -1 otherwise.

  6. Update the global index and maximum if necessary.

Once you understand how the reduction and mask operations work, complete the assembly function in benchmarks/vec-imax/vec_imax.S according to the TODO comments. When you are ready to test your code, build and run it on the ISA simulator:

make vec-imax.riscv.out

Reduction Operations

You may find the https://inst.eecs.berkeley.edu/~cs152/sp22/handouts/sp22/riscv-v-spec-0.10.html#_vector_single_width_floating_point_reduction_instructions section of the RVV spec to be useful.

Submission

Run the following to collect all of your code for the directed portion into one archive, and upload to the Gradescope autograder.

make zip-directed

The following source files should be present at the root of the ZIP file:

  • vec_cmplxmult.S
  • vec_dgemv.S
  • vec_dgemm_inner.S
  • vec_dgemm_remainder.S
  • vec_imax.S

The directed problems are evaluated based on correctness, so please check that your code passes the autograder test suite. Note, you only have to complete 2 of the 4 kernels.

No written report is required for the directed portion.

  1. Equivalent to Rocket Chip with a 1 GHz mbus frequency and 667 MHz DRAMSim2 frequency 

  2. This lab requires a different software toolchain that has experimental support for RVV 0.10. 

  3. https://github.com/riscv/riscv-elf-psabi-doc/blob/master/riscv-elf.md 

  4. Since their implemention can be non-trivial, the segment instructions used to be specified as an optional extension (Zvlsseg). However, they are now required for compliant RVV implementations as of v1.0. 

  5. As one of the design constraints on the base V extension is to not consume too much 32-bit encoding space, non-destructive three-operand instructions are not provided. 

  6. Alternatively, the i and j loops could be interchanged, which would permit the vector load of x to be hoisted out of the inner loop and reused for all rows of A. However, each iteration would necessitate a reduction operation. If the matrix were instead transposed to column-major order, vectorization along the m dimension would be simpler as explicit reductions can be entirely avoided while fully reusing each element of x

  7. On vector architectures that do not feature explicit reduction instructions, this can be implemented by recursively halving the vector length and adding both halves together with vector addition