Open-Ended Portion (0%) – Optional

Select one of the following questions per team.

In writing your optimized implementation and describing your methodology in the report, the goal is to demonstrate your understanding of vector architectures, the sources of their performance advantages, and how these qualities can be employed in important computational kernels.

Sparse Matrix-Vector Multiplication (`spmv`)

For this problem, you will implement and optimize RISC-V vector code for sparse matrix-vector multiplication (SpMV), which is extensively used for graph processing and machine learning. SpMV computes $y = Ax$, where $A$ is a sparse matrix and $x$ is a dense vector. Unlike other dense linear algebra algorithms, most entries of $A$ are zero, and the matrix is represented in a condensed format in memory. The unoptimized pseudocode is shown below.

for (i = 0; i < n; i++) {
    for (k = ptr[i]; k < ptr[i+1]; k++) {
        y[i] += A[k] * x[idx[k]];
    }
}

A scalar reference implementation is provided in benchmarks/spmv. Build and run the benchmark as follows:

make spmv.riscv.out

Add your own double-precision vector implementation to benchmarks/vec-spmv/vec_spmv.S. When you are ready to test your code, build and run it on the ISA simulator:

make vec-spmv.riscv.out

Optimization

Once your code is correct, do your best to optimize spmv to minimize the number of cycles (per mcycle).

You are only allowed to write code in ; do not change any code in except for debugging. If you would like to perform some transformation on the inputs, only do so after you have verified the non-transformed version.

Common techniques that generally work well are loop unrolling, loop interchange, lifting loads out of inner loops and scheduling them earlier, blocking the code to utilize the full register file, and transposing matrices to achieve unit-stride accesses for improved locality.

More specific to vector architectures, try to refactor all element loads into vector loads. Use fused multiply-add instructions where possible. Also, carefully choose which loop(s) to vectorize for this problem, as not all loops can be safely vectorized!

Submission

Use the following command to archive your code for submission, and record the resulting file to the Gradescope autograder.

make zip-spmv

In a separate report, roughly explain how your SpMV implementation works, and report the dynamic instruction count and cache statistics. Also explain how you arrived at your implementation, and describe at least three optimizations that you applied in detail. How much improvement did they yield over the previous iterations of your code?

Radix Sort (`rsort`)

For this problem, you will implement and optimize RISC-V vector code for radix sort (rsort), a non-comparative integer sorting algorithm. The unoptimized pseudocode is shown below. In iteration $i$, each value is assigned to a bucket by its $i$-th digit, starting from the least-significant digit. After all values are allocated to buckets, they are merged sequentially into a new list. This algorithm repeats until all digits in every value have been traversed.

// TYPE_MAX: maximum value of the data type
for (power = 1; power < TYPE_MAX; power *= BASE) {
    // Number of buckets = BASE
    for (k = 0; k < BASE; k++) {
        buckets[k] = [];
    }
    for (j = 0; j < ARRAY_SIZE(array); j++) {
        key = array[j];
        // Extract next digit
        digit = (key / power) % BASE;
        // Assign value to bucket
        buckets[digit].append(key);
    }
    // Merge buckets sequentially
    new_array = [];
    for (k = 0; k < BASE; k++) {
        new_array.extend(bucket[k]);
    }
    array = new_array;
}

A scalar reference implementation is provided in benchmarks/rsort. Rather than directly assigning the values to buckets, the optimized version uses the buckets to produce a histogram of occurrences of each digit, which is then used to compute the offset of each element in the new array. Thus, the merge step becomes a permutation. Build and run the benchmark as follows:

make rsort.riscv.out

Add your own vector implementation to benchmarks/vec-rsort/vec_rsort.S. The data values are 32-bit unsigned ints. When you are ready to test your code, build and run it on the ISA simulator:

make vec-rsort.riscv.out

Optimization

You can feel that this may be a more challenging algorithm to vectorize than usual, so focus on first getting the code to work correctly before embarking on any complicated optimizations.

Once your code passes the given test, do your best to optimize rsort to minimize the number of cycles (per mcycle). You are only allowed to write code in ; do not change any code in except for debugging.

Indexed vector memory operations (scatter/gatter) and predication will be heavily used in this benchmark. Take care when updating buckets, as the same bucket may be accessed multiple times by a vector operation.

The general suggestions for spmv in Problem 4.1 also apply to rsort. In practice, vrem can be expensive in terms of performance, being essentially a long-latency divide operation per element, so it should be avoided. You may also notice that every bucket might be able to fit in a single vector; if so, consider how buckets could be kept in the same vectors across iterations.

Submission

Use the following command to archive your code for submission, and upload the resulting file to the Gradescope autograder.

make zip-rsort

In a separate report, roughly explain how your rsort implementation works, and record the dynamic instruction count and cache statistics. Also explain how you arrived at your implementation, and describe at least three optimizations that you applied in detail. How much improvement did they yield over the previous iterations of your code?

Open-Ended Portion (0%) – Optional

Sparse Matrix-Vector Multiplication (spmv)

Optimization

Submission

Radix Sort (rsort)

Optimization

Submission

Sparse Matrix-Vector Multiplication (`spmv`)

Radix Sort (`rsort`)