Introduction and Goals

In this lab, you will write RISC-V vector assembly code to gain a better understanding of how data-parallel code maps to vector-style processors, and to practice optimizing vector code for a given implementation.

While students are encouraged to discuss solutions to the lab assignments with each other, you must complete the directed portion of the lab yourself and submit your own work for these problems.

Graded Items

For Lab 4, the only required submission is your code for the directed portion. You only need to complete 2 of the 4 problems for the directed portion. All code is to be submitted through Gradescope. Please follow Section 3.9 for instructions on how to submit your code to the Gradescope autograder. The open-ended portion is optional, and will not be graded.

(Directed) Problem 3.5: cmplxmult code
(Directed) Problem 3.6: dgemv code
(Directed) Problem 3.7: dgemm code
(Directed) Problem 3.8: imax code
(Open-ended) Problem 4.1: spmv code
(Open-ended) Problem 4.2: rsort code

Background

The RISC-V vector ISA programming model is best explained by contrast with other, popular data-parallel programming models. As a running example, we use a conditionalized SAXPY kernel, CSAXPY. The 2 programs below show CSAXPY expressed in C as both a vectorizable loop and as a SPMD (single-program multiple-data) kernel. CSAXPY takes as input an array of boolean conditions, a scalar a, and vectors x and y; it computes y += ax for the elements for which the condition is true.

void csaxpy(size_t n, bool cond[], float a, float x[], float y[])
{
    for (size_t i = 0; i < n; i++)
        if (cond[i])
            y[i] = (a * x[i]) + y[i];
}

void csaxpy_spmd(size_t n, bool cond[], float a, float x[], float y[])
{
    if (tid.x < n)
        if (cond[tid.x])
            y[tid.x] = (a * x[tid.x]) + y[tid.x];
}

Packed-SIMD Programming Model

The code block below shows CSAXPY mapped to a hypothetical packed-SIMD architecture, similar to Intel’s SSE and AVX extensions. This SIMD architecture has 128-bit registers, each partitioned into four 32-bit fields. As with other packed-SIMD machines, ours cannot mix scalar and vector operands, so the code begins by broadcasting (or “splatting”) copies of a to each field of a SIMD register.

To map a long vector computation to this architecture, the compiler generates a stripmine loop, each iteration of which processes one four-element vector. In this example, the stripmine loop consists of a load from the conditions vector (line 6), which in turn is used to set a predicate register (line 7). The next four instructions (lines 8–11), which correspond to the body of the if statement in the first code block above, are masked by the predicate register.¹ Finally, the address registers are incremented by the SIMD width (lines 13–14), and the stripmine loop is repeated until the computation is finished (line 15) – almost.

Since the stripmine loop handles four elements at a time, extra code is needed to handle up to three fringe elements at the end. For brevity, we omitted this code; in this case, it suffices to duplicate the loop body, predicating all of the instructions on whether their index is less than n.

csaxpy_simd:
     slli    a0, a0, 2
     add     a0, a0, a3
     vsplat4 vv0, fa0
stripmine_loop:
     vlb4    vv1, (a1)
     vcmpez4 vp0, vv1
!vp0 vlw4    vv1, (a2)
!vp0 vlw4    vv2, (a3)
!vp0 vfmadd4 vv1, vv0, vv1, vv2
!vp0 vsw4    vv1, (a3)
     addi    a1, a1, 4
     addi    a2, a2, 16
     addi    a3, a3, 16
     bltu    a2, a0, stripmine_loop
     # handle fringe cases when (n % 4) != 0
     # ...
     ret

Lastly, the fourth code block below shows CSAXPY mapped to a similar packed-SIMD architecture without predication support. The compare instruction writes the mask to a SIMD register instead (line 7). The bulk of the computation is done regardless of the condition (lines 8–10), and vblend4 selects the new value or the old value depending on the mask (line 11). The fringe case must be handled in scalar code due to the lack of predication.

csaxpy_simd:
     slli    a0, a0, 2
     add     a0, a0, a3
     vsplat4 vv0, fa0
stripmine_loop:
     vlb4    vv1, (a1)
     vcmpez4 vv3, vv1
     vlw4    vv1, (a2)
     vlw4    vv2, (a3)
     vfmadd4 vv1, vv0, vv1, vv2
     vblend4 vv1, vv3, vv1, vv2
     vsw4    vv1, (a3)
     addi    a1, a1, 4
     addi    a2, a2, 16
     addi    a3, a3, 16
     bltu    a2, a0, stripmine_loop
     # handle fringe cases when (n % 4) != 0
     # ...
     ret

The most important drawback to packed-SIMD architectures lurks in the assembly code: The SIMD width is expressly encoded in the instruction opcodes and address generation code. When the architects of such an ISA wish to increase performance by widening the vectors, they must add a new set of instructions to process these vectors. This consumes substantial opcode space: for example, Intel’s newest AVX instructions are as long as 11 bytes. Worse, application code cannot automatically leverage the widened vectors without being recompiled to use the new instructions. Conversely, code compiled for wider SIMD registers fails to execute on older machines with narrower ones.

SIMT Programming Model

The following code block shows the same code mapped to a hypothetical SIMT architecture, akin to an NVIDIA GPU. The SIMT architecture exposes the data-parallel execution resources as multiple threads of execution; each thread executes one element of the vector. The microarchitecture fetches an instruction once but then executes it on many threads simultaneously using parallel datapaths; therefore, a scalar instruction shown in the code executes like a vector instruction.

csaxpy_simt:
    mv      t0, tid
    bgeu    t0, a0, skip
    add     t1, a1, t0
    lbu     t1, (t1)
    beqz    t1, skip
    slli    t0, t0, 2
    add     a2, a2, t0
    add     a3, a3, t0
    flw     ft0, (a2)
    flw     ft1, (a3)
    fmadd.s ft0, fa0, ft0, ft1
    fsw     ft0, (a3)
skip:
    stop

One inefficiency of this approach is immediately evident: Since the number of launched threads must be a multiple of the warp size (32 for NVIDIA GPUs), the first action taken by each thread is to determine whether it is within bounds (lines 2–3). Another inefficiency results from the duplication of scalar computation: Despite the unit-stride access pattern, each thread explicitly computes its own addresses. (The SIMD architecture, in contrast, amortizes this work over the SIMD width.) Memory coalescing logic is then needed to recover the original unit-stride access pattern from the individual memory requests issued by each thread, which otherwise must be treated as a scatter/gather. Moreover, massive replication of scalar operands reduces the effective utilization of register file resources: Each thread has its own copy of the three array base addresses and the scalar a. This represents a threefold increase over the fundamental architectural state.²

Traditional Vector Programming Model

Packed SIMD and SIMT architectures have a disjoint set of drawbacks: The main limitation of the former is the static encoding of the vector length, whereas the primary drawback of the latter is the lack of scalar processing. One can imagine an architecture that has the scalar support of the former and the dynamism of the latter. In fact, it has existed for over 40 years, in the form of the traditional vector machine, embodied by the Cray-1.

The key feature of this architecture is the vector length register (VLR), which represents the number of vector elements that will be processed by each vector instruction, up to the hardware vector length (HVL). Software manipulates the VLR by requesting a certain application vector length (AVL) with the vsetvl instruction; in response, the vector unit sets the active vector lenth to the shorter of the AVL and the HVL.

As with packed SIMD architectures, a stripmine loop iterates until the application vector has been completely processed. But, as the csaxpy-tvec code block below shows, the difference lies in the adjustment of the VLR at the head of every loop iteration (line 3). Most importantly, the software is agnostic to the underlying hardware vector length: The same code executes correctly and with maximal efficiency on machines with any HVL. Secondly, no fringe code is required at all: On the final trip through the loop, the VLR is simply set to the exact remainder.

csaxpy_tvec:
stripmine_loop:
     vsetvl   t0, a0
     vlb      vv0, (a1)
     vcmpez   vp0, vv0
!vp0 vlw      vv0, (a2)
!vp0 vlw      vv1, (a3)
!vp0 vfmadd.s vv0, vv0, fa0, vv1
!vp0 vsw      vv0, (a3)
     add      a1, a1, t0
     slli     t1, t0, 2
     add      a2, a2, t1
     add      a3, a3, t1
     sub      a0, a0, t0
     bnez     a0, stripmine_loop
     ret

The advantages of traditional vector architectures over the SIMT approach are owed to the coupled scalar control processor. The scalar register file holds only one copy of the array pointers and the scalar a. The address computation instructions execute only once per stripmine loop iteration, rather than once per element, effectively amortizing their cost by a factor of the HVL.

RISC-V Vector Programming Model

The RISC-V “V” vector extension (RVV) resembles a traditional vector ISA, with some key distinctions. In particular, the organization of the vector register file is dynamically configurable through the vector type register (vtype), which consists of two fields: SEW and LMUL. These dictate how the vector state is conceptually partitioned along two orthogonal dimensions.

The standard element width (SEW) sets the default width of each vector element in bits. A narrower SEW enables elements to be more densely packed into the available storage, increasing the maximum vector length. SEW also determines the operation widths of polymorphic vector instructions, which allow reusing the same instruction to support a variety of data widths (e.g., both single-precision and double-precision floating-point), thereby conserving valuable opcode space.

The length multiplier (LMUL) is an integer power of 2 (from 1 to 8) that controls how many consecutive vector registers are grouped together to form longer vectors. With LMUL=2, vector registers v$n$ and v$n+1$ are operated on as one vector with twice the maximum vector length. Instructions must use vector register specifiers evenly divisible by LMUL; attempts to invalid specifiers raise an illegal instruction exception. LMUL serves to increase efficiency through longer vectors when fewer architectural registers are needed, as well as to accommodate mixed-width operations (as a mechanism to maintain identical vector lengths among vectors of different datatype widths).

The csaxpy_rvv code block below shows CSAPY mapped to RVV 0.10. The vsetvli instruction enables both the vector length and vtype to be configured in one step. SEW is initially set to 8 bits (line 3) to load the boolean values from cond. The second vsetvli instruction (line 6) widens SEW to 32 for single-precision operations; the special use of x0 for the requested vector length has the effect of retaining the current vector length.

csaxpy_rvv:
stripmine_loop:
    vsetvli t0, a0, e8, m2, ta, ma  # set VL; configure SEW=8 LMUL=2
    vle8.v v8, (a1)                 # load cond[i]
    vmsne.vi v0, v8, 0              # set mask if cond[i] != 0
    vsetvli x0, x0, e32, m8, ta, mu # configure SEW=32 LMUL=8; retain VL
    vle32.v v8, (a2)                # load x[i]
    vle32.v v16, (a3)               # load y[i]
    vfmacc.vf v16, fa0, v8, v0.t    # y[i] = (a * x[i]) + y[i] if cond[i] != 0
    vse32.v v16, (a3)               # store y[i]
    sub a0, a0, t0                  # decrement n
    add a1, a1, t0                  # bump pointer cond
    slli t0, t0, 2                  # scale VL to byte offset
    add a2, a2, t0                  # bump pointer x
    add a3, a3, t0                  # bump pointer y
    bnez a0, stripmine_loop
    ret

There are no separate vector predicate registers in RVV, which reduces the minimum architectural state. In the base V extension, predicated instructions always use v0 as the source of the vector mask, while other vector registers can be used to temporarily hold mask values computed with vector logical and comparison instructions. Annotating a maskable instruction with v0.t (line 9) causes the operation to be executed conditionally based on the least-significant bit of the mask element in v0.

In the example, note that the vector mask is computed under SEW=8 but consumed under SEW=32. For the predicate bit of each element to remain in the same positions under both vtype settings, the SEW/LMUL ratio must be kept constant. (The “(https://inst.eecs.berkeley.edu/~cs152/sp22/handouts/sp22/riscv-v-spec-0.10.html#sec-mask-register-layout)” section explains the constraints involved.) Hence, it is necessary to set LMUL=2 when SEW=8 to match the use of LMUL=8 later.

Specification Versioning

The V extension has evolved substantially over the past few years. Both this lab and the lecture use the 0.10 draft of the specification, archived at https://inst.eecs.berkeley.edu/~cs152/sp22/handouts/sp22/riscv-v-spec-0.10.html.

The latest working draft is available at https://github.com/riscv/riscv-v-spec. In particular, the vector extension was officially ratified in November 2021, and the first commercial processors implementing it have become available since then. Older iterations of Lab 4 were built on the 0.4 specification, from which the current proposal diverges radically enough that they should be regarded as two different ISAs.

Here are some of the important sections to look at in the RVV specification:

Vector Loads and Stores (Section 7)
Vector Floating-Point Instructions (Section 14)
Vector Reduction Operations (Section 15)
Vector Mask Instructions (Section 16)
Vector Permutation Instructions (Section 17)

Acknowledgments

This lab was originally designed by Donggyu Kim in spring 2018. It is heavily inspired by the previous sets of CS 152 labs developed by Christopher Celio, which targeted the Hwacha vector processor .

We treat packed-SIMD architectures generously by assuming the support of full predication. This feature was quite uncommon. Intel AVX, for example, only introduced predication with the recent AVX-512 extension in 2016, first in a subset of Xeon models and only later available in mainstream products starting with Cannon Lake. ↩
More recent GPU architectures, such as the RDNA ISA from AMD, have incorporated support for scalar values and vector memory operations. ↩