Cache Example and Solution

- Processor Characteristics
- Freq = 200 MHz
- Superscalar capable of 4 IPC
- Achieved in absence of memory delays is 2 IPC
- Capable of 2 load/store per clock
- Average ref/inst = 0.2 read + 0.1 write
- Data Cache
- Tc = 5 ns, Ta = 10 ns
- dual-ported
- 64B lines
- Miss rate = 1%
- Write-Back, Write Allocate
- 50% replaced lines are dirty
- Memory
- Ta=100ns, Tc=50 ns
- 4-way interleave on word boundary
- Bus
- 8B wide, 100 MHz clock

The basic timing for a cache line access of 8 words follows the analysis in 6.8.2: The total line access time, including address and data transfers over the bus is 200 ns. This is 40 clock cycles at the 200 MHz rate of the processor.

Simpler Design

We assume the dirty line is written back first, then the missing line is read before the processor can proceed.

The average miss penalty requires us to write back a dirty line 50% of the time, then read the missing line. So this is an average of 60 cycles (0.5*40 + 1*40).

The number of data references (reads + writes) per instruction is 0.3. The number of misses per instruction with 1% miss rate is 0.003.

So the average miss delay per instruction is .003 miss/inst * 60 cycles/miss = 0.18 CPI.

The average execution time per instruction including memory delay = 0.68 CPI

Higher Performance Design

We assume the missing line is read with critical word first, then the processor is allowed to proceed. The memory is busy for the rest of the line read and, when the replaced line is dirty, for the line write-back.

From the timing diagram above we see that the first (critical) word arrives back on the bus after 120 ns, which is 24 processor clocks. So T_{cache} = 24 clocks and T_{busy} = 38 clocks. (Tbusy is the remaining time of 18 clocks for the line read and an average of 20 clocks for write back.)

The probability of experiencing a second cache miss during T_{busy} is approximately

0.006*38 = 0.228 since there are .006 misses per clock in the absence of memory delays. (That is calculated from 2 IPC and 0.003 miss/inst.) The average penalty for the procesosr to wait for a miss during T_{busy} is approximately 19 clocks (just half of the 38 total).

So the average miss penalty per instruction = .003 miss/inst * (24 + .228*19) clock/miss = .085.

The average execution time per instruction including memory delay = 0..585 CPI

This is 16% better performance than the simpler design evaluated above.

Note: There are many assumptions built into the analysis that are certainly inaccurate. But the method is still useful as a BOE ("Back Of the Envelope") approach to guide the design process before refining with more detailed simulation.

- assuming resubmissions of denied requests
- B(r,n) = nr
_{a }solve for r_{a}iteratively - nr
_{a }= 1- (1-a)^{n }where a is

a = r/(r + (r_{a}/r)(1 -r)) initially let a= r - Example
- Two processors share a bus. Each has 30% bus utilization in the absence of contention from another processor (i.e., uniprocessor)
- r = 0.3, n = 2

So the solution converges with achieved utilization of 27% vs. 30% with no contenction. That is, each processor slows down 10% due to bus contention with the other processor.