Open-Ended Portion: Optimizing Multi-Threaded Matrix Multiply (75%)

For this problem, you will implement and optimize a multi-threaded version of matrix-matrix multiply.

A naive implementation can be found in ${LAB5ROOT}/lab/mt-matmul-naive/ mt-matmul_naive.c. First build and run the mt-matmul-naive benchmark to measure the baseline performance.

 cd ${LAB5ROOT}/sims/verilator
 make CONFIG=Lab5MIDualRocketConfig run-binary BINARY=${LAB5ROOT}/lab/mt-matmul-naive.riscv LOADMEM=1
 make CONFIG=Lab5MSIDualRocketConfig run-binary BINARY=${LAB5ROOT}/lab/mt-matmul-naive.riscv LOADMEM=1

Write your code in the matmul_opt() function found in ${LAB5ROOT}/lab/ mt-matmul-opt/mt-matmul_opt.c, and rebuild the benchmarks.

 cd ${LAB5ROOT}/lab
 (source ~cs152/sp26/cs152.lab5.bashrc; make)

Then, run the mt-matmul-opt benchmark on the simulator as follows.

 make CONFIG=Lab5MIDualRocketConfig run-binary BINARY=${LAB5ROOT}/lab/mt-matmul-opt.riscv LOADMEM=1
 make CONFIG=Lab5MSIDualRocketConfig run-binary BINARY=${LAB5ROOT}/lab/mt-matmul-opt.riscv LOADMEM=1

Once your code passes the correctness test, do your best to optimize its performance. Your results from the MI and MSI runs will both be compared against a threshold to determine the final grade for this section. Go crazy!

Submission

Use the following command to prepare your code for submission, and upload the resulting file to the Gradescope autograder.

 cd ${LAB5ROOT}/lab
 make zip-matmul

Note that code outside mt-vvadd_opt.c will be ignored.

Only submit once per group. Make sure to add your group members to your autograder submission.

Matrix Multiply Tips

A number of strategies can be used to optimize your code for this problem. First, the problem size is for $32\times32$ matrices of int elements, with a total memory footprint of 12 KiB (the L1 data cache is only 4 KiB, 4-way set-associative). Common techniques that generally work well are loop unrolling, lifting loads out of inner loops and scheduling them earlier, blocking the code to utilize the full register file, transposing matrices to achieve unit-stride accesses to make full use of the L1 cache lines, and loop interchange.

You will also want to minimize sharing between cores; in particular, you will want to have each core responsible for writing its own pieces of the arrays (to avoid false sharing that causes lines to ping-pong between caches). Under the MI coherence protocol, it is also useful to avoid having both cores access the same portions of the input arrays at any given time, as there is no “S” state to accommodate shared lines.

Debugging with Spike:

If you encounter an error with incorrect outputs, you can first debug your code in the ISA-level simulator.

 cd ${LAB5ROOT}/lab
 make
 spike -p2 mt-matmul-opt.riscv

The -p option sets the number of simulated hardware threads.

Note that spike is not a cycle-accurate processor model, and the “performance” numbers can be distorted since the hardware threads do not execute concurrently (unlike our actual system) but are switched after some number of instructions. Also, this coarse-grained interleaving does not expose every race condition that would be possible on hardware.