Directed Portion (25%)

General Methodology

This lab will focus on writing multi-threaded C code. This will be done in two steps:

Build the Verilog cycle-accurate emulator of the dual-core processor (if the cache coherence protocol needs to be changed)
Verify the correctness and measure the performance of your code on the cycle-accurate emulator

Setup

To complete this lab, ssh into an instructional server with the instructional computing account provided to you. The lab infrastructure has been set up to run on the eda{1..3}.eecs.berkeley.edu machines.

First, clone the lab materials into an appropriate workspace and initialize the submodules.

cd ~ # go to your home directory
source conda/etc/profile.d/conda.sh
git clone https://github.com/cs152-teach/chipyard-cs152.git -b sp26-lab5 lab5
cd lab5
LAB5ROOT=$PWD
./build-setup.sh riscv-tools --skip-toolchain --skip-firesim --skip-marshal --skip-circt
source env.sh

Anytime you open a new terminal, make sure to run the following commands.

cd ~ # go to your home directory
source conda/etc/profile.d/conda.sh
cd lab5
LAB5ROOT=$WD
source env.sh

The remainder of this lab will use ${LAB5ROOT} to denote the path of the working tree. Its directory structure is outlined below:

${LAB5ROOT}/
├── Lab/                    # Benchmark source code
│   └── mt-vvadd-naive/               # Naive vvadd code
│   └── mt-vvadd-opt/                 # Optimized vvadd code
│   └── mt-matmul-naive/              # Naive matmul code
│   └── mt-matmul-opt/                # Optimized matmul code
├── generators/             # Library of RTL generators
│   ├── chipyard/                     # SoC configurations 
│   ├── rocket-chip/                  # Rocket Chip generator 
│   ├── testchipip/                   # RTL blocks for interfacing with test chips 
|   ├── ...
└── sims/
    └── verilator/                  # Verilator simulation flow
    ...

Measuring Vector-Vector Add with MSI Coherence

First, to acclimate ourselves to the Lab 5 infrastructure, we will gather the results of a poorly written implementation of vvadd, which performs a simple vector-vector addition.

Navigate to the ${LAB5ROOT}/lab/mt-vvadd-naive directory, which has a few files of interest. First, holds a static copy of the input vectors and expected results vector. Second, contains code for managing the benchmark, which includes initializing the state of the program, calling the vvadd function itself, and verifying the output of the function. Lastly, a very poor implementation of multi-threaded vvadd can be found in .

Build the simulator and run the mt-vvadd-naive benchmark on a dual-core configuration with an MSI coherence policy.

You will need to run the following command to build the multithreaded benchmarks.

cd ${LAB5ROOT}/lab
(source ~cs152/sp26/cs152.lab5.bashrc; make) 

The parenthesis creates an isolated environment for the RISCV compiler since it is incompatible with the spike and chipyard environment we are using.

cd ${LAB5ROOT}/sims/verilator
make CONFIG=Lab5MSIDualRocketConfig run-binary BINARY=${LAB5ROOT}/lab/mt-vvadd-naive.riscv LOADMEM=1

make will automatically rebuild the simulator if changes to its sources are detected. The CONFIG variable instructs the generator to use the configuration with two cores and MSI coherence.

Note that the first time you build Lab5MSIDualRocketConfig and Lab5MIDualRocketConfig, it will take a while for verilator to build the configurations. However, after the first time, running the make commands should take much faster.

You should see something similar to the following output for mt-vvadd-naive, which comes from timing a section of code that calls the vvadd_naive() function:

vvadd_naive: 44551 cycles, 44.5 cycles/iter, 8.8 CPI

Measuring Vector-Vector Add with MI Coherence

Run the mt-vvadd-naive benchmark again but using an MI coherence policy.

cd ${LAB5ROOT}/sims/verilator
make CONFIG=Lab5MIDualRocketConfig run-binary BINARY=${LAB5ROOT}/lab/mt-vvadd-naive.riscv LOADMEM=1

Some things to consider:

Taking into account that the code is executed on a multi-core cache-coherent system, consider the naive implementation in vvadd_naive(). Why is it suboptimal on MI and MSI? What are some potential changes you could make to improve the performance?

Optimizing Multi-Threaded Vector-Vector Add

Now that you know how to run benchmarks, gather performance results, and change the cache coherence protocol, you can now optimize vvadd for the dual Rocket cores.

Write your code in the vvadd_opt() function found in ${LAB5ROOT}/lab/mt-vvadd-opt/ mt-vvadd_opt.c and rebuild the vvadd_opt benchmark:

cd ${LAB5ROOT}/lab
(source ~cs152/sp26/cs152.lab5.bashrc; make) 

Then, run the benchmark on the simulator with both the MI and MSI configs using the commands below.

cd ${LAB5ROOT}/sims/verilator
make CONFIG=Lab5MIDualRocketConfig run-binary BINARY=${LAB5ROOT}/lab/mt-vvadd-opt.riscv LOADMEM=1
make CONFIG=Lab5MSIDualRocketConfig run-binary BINARY=${LAB5ROOT}/lab/mt-vvadd-opt.riscv LOADMEM=1

Submission

Use the following command to prepare your code for submission, and upload the resulting file to the Gradescope autograder. Follow the steps in Section 1.2 to download the .zip file locally for uploading to Gradescope.

cd ${LAB5ROOT}/lab
make zip-vvadd

Note that code outside of will be ignored.

There is no written section for the directed portion that you need to submit.

Vector-Vector Add Tips

Refer back to the Background Section for potential pitfalls with programming in this bare-metal environment. Remember that you can use printf() for debugging, with caveats: Floating-point values are not supported, and make sure only one thread calls printf() at a time.

The benchmark prints the contents of the output_data and verify_data arrays if a mismatch is found.

To see what each core is doing cycle by cycle, look at the trace in ${LAB5ROOT}/sims/ verilator/output/chipyard.harness.TestHarness.*/mt-vvadd-opt.riscv.out, where is the chosen CONFIG. The output from core 0 and core 1 is prefixed with C0: and C1:, respectively. The disassembly in ${LAB5ROOT}/lab/mt-vvadd-opt.riscv.dump may be useful for understanding the trace.

You can force make to repeat the simulation by manually removing the file from the simulator output directory.

Debugging with Spike:

If you encounter an error with incorrect outputs, you can first debug your code in the ISA-level simulator.

cd ${LAB5ROOT}/lab
make
spike -p2 mt-vvadd-opt.riscv

The -p option sets the number of simulated hardware threads.

Note that spike is not a cycle-accurate processor model, and the “performance” numbers can be distorted since the hardware threads do not execute concurrently (unlike our actual system) but are switched after some number of instructions. Also, this coarse-grained interleaving does not expose every race condition that would be possible on hardware.