Open-ended Portion (80%)

Select one of the following questions per team.

All open-ended questions should use the parameters for BOOM as shown in Table 2, unless otherwise specified. These should already match the CS152SWPredBoomConfig, and CS152SmallBoomConfig configurations provided with the lab.

BOOM open-ended configurations
	Problem 4.1	Problem 4.2
Configuration	`CS152SWPredBoomConfig`	`CS152SmallBoomConfig`
Issue width	3	1
Fetch width	4	4
ROB	96 entries	32 entries
Issue slots	32 entries	8 entries
Integer register file	96 physical registers	52 physical registers
FP register file	64 physical registers	48 physical registers
LDQ/STQ	16 entries	8 entries
Max branches	12 branches	8 branches
L1D capacity	16 KiB
L1D associativity	4 ways
MSHRs	2 MSHRs

Designing Your Own Branch Predictor

The version of BOOM provided in the CS152SWPredBoomConfig uses a simple bimodal predictor consisting of a Branch History Table (BHT) of 256 2-bit saturating counters. This was the same scheme implemented by the MIPS R10000 , although with 512 entries instead. For this problem, your goal is to build an improved branch predictor for BOOM that performs better than this baseline.

Feel free to scour the abundance of literature and historical examples for ideas. This technical report provides a useful survey of common branch prediction techniques. A good design to look into is the two-level tournament branch predictor from the Alpha 21264 , which combines three predictors:

A global predictor with a 4096-entry table of 2-bit saturating counters indexed by the global history of the last 12 branches
A local predictor with a 1024-entry table of 10-bit branch patterns indexed by PC, which is then used to index a set of 3-bit counters
An arbiter (or “choice”) predictor with a 4096-entry table of 2-bit counters indexed by global history, used to select either the global or local prediction as the final one

Especially impressive designs could attempt to beat the hardware implementation of the TAGE predictor algorithm provided in the lab. You may replace the WithSWBPD config option with WithTAGELBPD in BoomConfigs.scala to compare a hardware implementation of the TAGE predictor algorithm, with a 2 KB fast PC-indexed BHT, and 8KB of TAGE tables.

The Chisel implementations of various predictor components, including TAGE, BTBs, BHTs, and the RAS are available in ${LAB3ROOT}/generators/boom/src/main/scala/ifu/bpd.

C++ Framework

For the implementation, we will leverage the C++ branch predictor framework that exists in BOOM, which lets one more easily construct software models of a branch predictor’s intended functional behavior for rapid prototyping and verification.

Although it presents a relatively simplified interface compared to a real hardware predictor module, it is flexible enough to admit a wide space of potential designs while abstracting away much of the complexity of the surrounding logic in the core.

Implement your branch predictor design in ${LAB3ROOT}/generators/boom/src/main/resources/csrc/predictor_sw.cc. This file contains three functions to modify:

Function	Description
`initialize_branch` ↪ `_predictor()`	This is called at the beginning of simulation and can be used to initialize global state.
`predict_branch()`	This is called to provide a prediction for a branch or jump instruction with the given PC (`ip`). The global history of the most recent branches is passed as a bit vector in `hist`. The `globalHistoryLength` config parameter defaults to 32 for the C++ predictor. The prediction is returned by assigning 1 for taken and 0 for not taken to the variable pointed to by `pred`.
`update_branch()`	This is called to update the predictor state with branch resolution information. Updates occur non-speculatively in program order. `ip` provides the PC of the branch or jump instruction, `hist` provides the global history, and `taken` provides the outcome of the branch or jump.

Both predict_branch() and update_branch() may be called multiple times per cycle. Remember that the branch predictor is inherently accessed speculatively, so you may see requests for branches that are later squashed by the pipeline.

Simulating

Build the simulator¹ and run the benchmark suite using the same flow as the directed portion (note the different top-level configuration):

cd ${LAB3ROOT}/sims/verilator
make CONFIG=CS152SWPredBoomConfig
make CONFIG=CS152SWPredBoomConfig run-bmark-tests

To run an individual benchmark only (e.g., dhrystone, you may need to run-bmark-tests first to set up symlinks):

make CONFIG=CS152SWPredBoomConfig run-binary LOADMEM=1 \
    BINARY=output/chipyard.harness.TestHarness.CS152SWPredBoomConfig/dhrystone.riscv

Debugging

To dump waveforms from simulation, run the debug versions of the make targets:

cd ${LAB3ROOT}/sims/verilator
make CONFIG=CS152SWPredBoomConfig debug
make CONFIG=CS152SWPredBoomConfig run-binary-debug LOADMEM=1 \
     BINARY=output/chipyard.harness.TestHarness.CS152SWPredBoomConfig/dhrystone.riscv 

Waveform dumps (which can become quite sizeable) are written to ${LAB3ROOT}/sims/verilator/output/chipyard.harness.TestHarness.CS152SWPredBoomConfig/*.vcd. The branch predictor is located at TestDriver.testHarness.chiptop0.system.tile_prci_domain.element_reset_domain_boom_tile.frontend.bpd.banked_predictors.pred_harness in the module hierarchy.

Waveforms can be viewed on the instructional servers with the DVE application or GTKWave, which requires X11 forwarding over ssh or X2Go:

dve &   !{\color{blue}\# `\&' backgrounds the process}!

Benchmarks

The source code for all benchmarks can be found in ${LAB3ROOT}/toolchains/riscv-tools/riscv-tests/benchmarks/. First initialize the riscv-tests submodule:

git submodule update --init ${LAB3ROOT}/toolchains/riscv-tools/riscv-tests

The disassemblies that correspond to the pre-installed binaries are available at ${RISCV}/riscv64-unknown-elf/share/riscv-tests/benchmarks/*.riscv.dump.

Submission

In your report, present the IPC and branch prediction accuracy results of your custom predictor on all benchmarks. Describe your design approach and any implementation challenges in detail, and explain its performance characteristics. Be sure to cite the appropriate sources if you borrowed from existing concepts.

How did you calculate the amount of state your branch predictor has?
How do certain parameters (e.g., number of entries) impact accuracy?
Which branches or patterns were easier or harder to predict?
What kind of application code do you expect your predictor to perform better or worse on?
Do you foresee any challenges with the implementation of your predictor algorithm as a hardware block within a superscalar, out-of-order core?
What changes or alternative approaches would you pursue as future work if more time were available?

Feel free to reach out to your GSI if you need help understanding BOOM, branch prediction schemes, or anything else regarding this problem.

Recreating Spectre Attacks

It turns out that BOOM, like many out-of-order processors, is susceptible to a class of microarchitectural side-channel attacks that exploit branch prediction, speculative execution, and cache timing to leak information from memory, bypassing security mechanisms such as virtual memory and bounds checks. These first came to prominance with the Spectre and Meltdown vulnerabilities disclosed in 2018. For this problem, your goal is to mount a Spectre attack on BOOM to extract secret data from protected kernel memory.

Background

First read the Spectre paper and the Google Project Zero post to understand the basic principles and techniques behind the Spectre exploit. The proof-of-concept in Appendix C of the paper may be a useful reference as you write your code.

Although not the focus of this problem, it is also worth reading a little about Meltdown , a closely related variant that arises from deferred TLB permissions checks.

Attack Scenario

In this scenario, you control a malicious adversary that runs as an unprivileged program in user mode (U-mode) on top of the RISC-V proxy kernel (pk), a lightweight execution environment that assumes the role of a minimalistic operating system. pk is designed to support tethered RISC-V implementations with limited I/O capability and thus handles I/O-related system calls by proxying (forwarding) them to a host – in this case, the machine running the simulation.

pk itself runs in the higher-privilege supervisor mode (S-mode) but shares the same virtual address space as the user program. The lower 2 GiB of the virtual address space is reserved for the user program, while pk is mapped into the upper portion after 0x80000000.² This is a common technique in operation systems to facilitate more efficient communication between kernel and user space. For our purposes, this fact merely makes it slightly easier to reason about the addresses of the secret data and the privileged code being targeted.³

The objective of the adversary is to learn the values of a contiguous 128-byte array in the static .data.secret section in pk, representing a secret key. This data is ordinarily inaccessible to user programs, as it resides in a page with supervisor-only read permissions.

We deliberately introduce a few artificial conditions to simplify the task without compromising the fundamental methodology:

The secret data is placed at a fixed virtual address that is already known.
The attack vector is a custom syscall handler in pk that contains a vulnerable Spectre gadget by design.
The Spectre gadget leaks only one bit at a time to minimize noise from cache thrashing.

The Spectre gadget, shown here lightly edited, is very similar to the canonical example for Spectre Variant 1 (bounds check bypass):

uint8_t leak_array[128]; // Two 64-byte cache lines

int sys_leak(size_t index, int shift)
{
    if (index < sizeof(leak_array)) {
        uint8_t data = leak_array[index];
        index = (data >> shift) & 0x1;
        return leak_array[index * 64];
    }
    return -1;
}

The first array access, which involves an index controlled by the adversary through a syscall argument, can be used to speculatively read an arbitrary byte in memory. The second array access then reads from different cache lines depending on the data value and the shift argument.

Note that the bounds check in the if statement prevents the results of invalid accesses from becoming architecturally visible. However, side effects in the cache induced by the data-dependent load persist after speculative execution, and these side effects can be measured to infer data values. While several approaches to cache-based covert channels exist, a Prime+Probe attack is likely the most practical option given that BOOM does not presently implement an unprivileged cache flush instruction.⁴

The overall flow of the attack would proceed as follows:

Training step: Mistrain the branch predictor to strongly predict that the if condition for the bounds check will be true, generally by repeatedly invoking the syscall with valid inputs.
Prime step: Fill the cache sets that leak_array would map to with known lines.
Exploit step: Invoke the syscall with malicious arguments. After the if condition is predicted false, the if body is speculatively executed. The inputs are specially crafted so that the out-of-bounds load points to a secret byte, of which one bit is used to construct the pointer for another load that evicts one of the primed lines. The misspeculated execution path is eventually reverted when the branch is resolved, but the cache state remains perturbed.
Probe step: Determine which line was evicted by measuring the access time to each line. The cache set that previously held the evicted line corresponds to the value of the leaked bit.
Repeat for enough samples to gain sufficiently high confidence.

It is not necessary to follow this procedure exactly. You may want to experiment with variations to improve accuracy and/or throughput.

Reverse Engineering

The source code for the sys_leak handler can be found in ${LAB3ROOT}/toolchains/riscv-tools/riscv-pk/pk/syscall.c. First initialize the riscv-pk submodule:

git submodule update --init ${LAB3ROOT}/toolchains/riscv-tools/riscv-pk

A pre-built pk-spectre executable is already provided along with the Lab 3 infrastructure. To figure out the virtual addresses of various symbols in pk (e.g., leak_array), refer to the symbol table in the fully linked binary:

riscv64-unknown-elf-readelf -s ${LAB3ROOT}/lab/open2/pk-spectre > pk.sym

To figure out the instruction PCs to compare to the BOOM traces, look at the disassembly:

riscv64-unknown-elf-objdump -d ${LAB3ROOT}/lab/open2/pk-spectre > pk.dump

To figure out the addresses of the secret data and to verify your results, dump the .data.secret section contents:

riscv64-unknown-elf-objdump -s -j .data.secret ${LAB3ROOT}/lab/open2/pk-spectre

Data words are displayed in big-endian byte order, i.e., the most significant (leftmost) byte in a word is stored at the lowest address. (The test data comes from the digits of $\pi$.)

Development and Testing

To help you start writing your code, ${LAB3ROOT}/lab/open2/spectre.c contains an example skeleton with helper functions to invoke the syscall and read the cycle counter. To compile:

cd ${LAB3ROOT}/lab/open2
make

Feel free to modify the user program however you wish: add a custom linker script, mix in assembly files, etc.

To run the simulation:⁵

cd ${LAB3ROOT}/sims/verilator
make CONFIG=CS152SmallBoomConfig run-spectre

Traces/logs go to and , respectively.

To enable logging of dispatched instructions in BOOM, open ${LAB3ROOT}/generators/boom/src/main/scala/common/config-mixins.scala and set enableDispatchPrintf to true under the WithNCS152SmallBooms config. ⁶ This should help give you sense of the size of the speculative execution window prior to the branch mispredict recovery.

To quickly test cache eviction/measurement code and other utility functions, you can write a bare-metal program in ${LAB3ROOT}/lab/open2/baremetal.c. Simulations are much shorter as it avoids waiting for pk to initialize and load the user program. This code executes in machine mode (M-mode) with physical addressing, and only a limited subset of libc is supported. Note: It is not possible to perform the entire attack in a bare-metal environment, as the victim syscall and secret data are not available without pk.

make CONFIG=CS152SmallBoomConfig run-binary-hex BINARY="${LAB3ROOT}/lab/open2/baremetal.riscv" 

Submission

Your Spectre attack code should be submitted through Gradescope. In your report, explain how your code works, and describe any challenges encountered and solutions that you used.

What is the accuracy and performance (cycles per secret byte) of your attack?
What changes could you adopt to increase the speed (in cycles per secret byte) of your attack? What are the tradeoffs or disadvantages of such techniques?
What are some hardware and/or software countermeasures against Spectre that you can think of? Discuss the advantages and disadvantages of each.

Feel free to reach out to your GSI if you need help understanding Meltdown/Spectre, BOOM, riscv-pk, or anything else regarding this problem.

Feedback Portion

In order to improve the labs for the next offering of this course, we would like your feedback. Please append your feedback to your individual report for the directed portion. These questions are also placed at the bottom of the open-ended part of the lab for your convenience.

How many hours did you spend on the directed and open-ended portions?
What did you dislike most about the lab?
What did you like most about the lab?
Is there anything that you would change?
Is there something else you would like to explore in the open-ended portion?
Are you interested in modifying hardware designs as part of the lab?

Feel free to write as much or as little as you prefer (a point will be deducted only if left completely empty).

Team Feedback

In addition to feedback on the lab itself, please answer a few questions about your team:

In one short paragraph, describe your contributions to the project.
Describe the contribution of each of your team members.
Do you think that every member of the team contributed fairly? If not, why?

The build system has been slightly modified for this lab to avoid re-elaborating Chisel if only C++ resource files have been changed. ↩
This offset is specifically chosen such that this virtual address range is identical to the physical address range where pk resides, i.e., VA = PA. ↩
Meltdown can be mitigated in hardware by not forwarding faulting loads or in software by separating the user and kernel address spaces, accomplished through kernel page-table isolation (KPTI) in Linux. Neither of these, however, defend against Spectre. ↩
Flush+Reload is theoretically possible since the L2 cache controller provides a mechanism to flush specific lines, which also evicts from the L1 due to the inclusive property, but the requisite MMIO control registers are not exposed to U-mode by default. ↩
The 1-wide BOOM configuration is preferred as it simulates much more quickly. ↩
You may also want to disable enableBranchPrintf to make the log more readable ↩