Introduction and Goals

The goal of this laboratory assignment is to allow you to explore shared memory multi-processor systems using the Chisel simulation environment. You will be provided a complete implementation of a dual-core Rocket processor supporting the RV64GC ISA. You will write multi-threaded C programs to gain a better understanding of how data-level parallel (DLP) code maps to multi-core processors and to practice optimizing code for different cache coherence protocols.

While students are encouraged to discuss solutions to the lab assignments with each other, you must complete the directed portion of the lab yourself and submit your own work for these problems. For the open-ended portion of each lab, students can either work individually or in groups of two or three. Each group will turn in a single report for the open-ended portion of the lab. You are free to participate in different groups for different lab assignments.

Graded Items

All code and/or reports are to be submitted through Gradescope. The directed portion muct be completed indivudally, and the open ended portion can be completed in groups of 2-3 students.

Directed Portion: Optimized vvadd code.
- You need to submit your code to the Gradescope autograder.
Open-ended Portion: Optimized matmul code
- You need to submit your code to the Gradescope autograder.
- If your performance is not satisfactory, you may optionally submit a written report for partial credit. See the Open-ended Section for more details.

Downloading Submission Files

To submit to the autograder, you will need to download the .zip files produced by the submission steps. You can use to move files between local and/or remote machines.

scp [OPTION] [user@]SRC_HOST:]file1 [user@]DEST_HOST:]file2

For example, to download the .zip files to your local Downloads folder, you can run the following commands.

scp <cs152-***>@eda-*.eecs.berkeley.edu:/<replace this with your actual path>/zip-vvadd.zip ~/Downloads

scp <cs152-***>@eda-*.eecs.berkeley.edu:/<replace this with your actual path>/zip-matmul.zip ~/Downloads

Background

Dual-Core Rocket Processor

Rocket will be returning from Lab 2, but this time, there are two Rocket cores.

Rocket is a 5-stage, single-issue, fully-bypassed, in-order RISC-V core. The configurations used in this lab implement the RV64IMAFDC instruction set variant¹, which refers to the 64-bit RISC-V base ISA (RV64I) along with a set of useful extensions : M for integer multiply/divide, A for atomic memory operations, F and D for single- and double-precision floating-point, and C for 16-bit compressed representations of common instructions

Rocket also supports the RISC-V privileged architecture with machine, supervisor, and user modes. It has an MMU that implements the Sv39 virtual memory scheme, which provides 39-bit virtual address spaces with 4 KiB pages. These processors are fully capable of booting mainstream operating systems such as Linux; however, no OS will be used in this lab, so code will run “bare metal” in M-mode.

Memory System

In this lab, you are provided with a dual-core system that utilizes a snoopy cache coherence protocol. Figure [fig:system] shows the high-level block diagram.

Each Rocket core has its own private L1 caches:

16 KiB 4-way set-associative L1 instruction cache
4 KiB 4-way set-associative L1 data cache

The data caches are kept coherent with one another.

An off-chip memory provides the last level of the memory hierarchy. Both cores are connected via a bus to main memory, which is backed by a DRAM model that simulates the functional and timing behaviors of a DDR3 memory system. Only one agent may access the bus at a time.

Conceptually, cache coherence is maintained by having caches broadcast their intentions across the bus and “snooping”, or monitoring, the actions of the other caches.

Multi-threaded Programming Environment

In most conventional multi-threaded programming environments, one thread begins execution at main(), which must then call some sort of spawn or clone() function to create more threads with assistance from the operating system.

In contrast, we will not be using an OS in this lab. Instead, all threads enter main() roughly simutaneously after a designated “boot” thread finishes initializing the C runtime environment. Each thread is provided with ncores (the number of cores in the system) and a coreid (a unique numerical identifier from 0 to ncores - 1, inclusive).

Memory Allocation

You will need to be careful how you allocate memory in your code. Local variables can be allocated on the stack as usual; however, each thread is reserved only a limited amount of stack space. You will want to use the static keyword to allocate variables statically in the executable image, where it is visible to all threads.

There is also the __thread storage class keyword, which denotes a thread-specific variable that should be located in thread-local storage (TLS). TLS is a mechanism by which each thread is given its own private instance of the variable. It requires significant orchestration between the linker and system libraries to work, but this complexity is largely transparent to user code.²

Synchronization Primitives

In the software framework, a barrier() function is provided to synchronize threads. Once a thread reaches the barrier() function, it waits until all threads in the system have reached the same barrier(). Implicit in the barrier is a memory fence. The barrier() function should probably be sufficient to implement any algorithm necessary in this lab.

For more information on the RISC-V memory ordering instructions, consult Section 2.7 of the user-level ISA manual . Section 14 defines the RVWMO (RISC-V weak memory ordering) memory consistency model. Appendix A offers a more in-depth explanation of the rationale behind RVWMO.

The RISC-V fence instruction can be inserted in C code using the __sync_synchronize() GCC built-in function (saving you the hassle of inlining assembly). The GCC compiler provides more built-in functions for atomic memory accesses, such as __sync_fetch_and_add().³

The fence instruction behaves as follows: If the data cache is not busy, the fence immediately retires, the pipeline continues execution. If the cache is busy servicing outstanding memory requests (i.e., cache misses), the fence stalls the pipeline until the cache is no longer busy. In this manner, the fence instruction ensures that all memory operations before the fence have completed before any memory operations after the fence are issued.⁴

Warnings and Pitfalls

The stack space provided to each thread is only 24 KiB. As there is no virtual memory protection, there will be no warning if you overrun your stack. Try to allocate arrays and other large data structures statically.
printf() can be used to debug your code. However, it is up to you to ensure that it is called by only one thread at a time; otherwise, the output may be imcomprehensibly interleaved. Also, the printf implementation in this lab (provided by a stripped-down version of newlib, an embedded C library) does not support formatting floating-point types. You will have to cast them to integer types first. However, you will note that the randomly generated test vectors are actually using whole numbers for convenience.

Acknowledgments

This lab was made possible through the hard work of Andrew Waterman and Henry Cook (among others) in developing the Rocket processor, memory system, cache coherence protocols, and multi-threading software environment. This lab was originally developed for CS152 at UC Berkeley by Christopher Celio.

Also known as RV64GC, with G (“general-purpose”) being the canonical shorthand for “IMAFD” ↩
http://people.redhat.com/drepper/tls.pdf ↩
https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html ↩
Finer-grained fences can be performed by setting the predecessor and successor fields in the instruction, which define which types of accesses (memory reads, memory writes, device reads, device write) should be ordered. ↩