Cache Example and Solution
The basic timing for a cache line access of 8 words follows the analysis in 6.8.2: The total line access time, including address and data transfers over the bus is 200 ns. This is 40 clock cycles at the 200 MHz rate of the processor.
We assume the dirty line is written back first, then the missing line is read before the processor can proceed.
The average miss penalty requires us to write back a dirty line 50% of the time, then read the missing line. So this is an average of 60 cycles (0.5*40 + 1*40).
The number of data references (reads + writes) per instruction is 0.3. The number of misses per instruction with 1% miss rate is 0.003.
So the average miss delay per instruction is .003 miss/inst * 60 cycles/miss = 0.18 CPI.
The average execution time per instruction including memory delay = 0.68 CPI
Higher Performance Design
We assume the missing line is read with critical word first, then the processor is allowed to proceed. The memory is busy for the rest of the line read and, when the replaced line is dirty, for the line write-back.
From the timing diagram above we see that the first (critical) word arrives back on the bus after 120 ns, which is 24 processor clocks. So Tcache = 24 clocks and Tbusy = 38 clocks. (Tbusy is the remaining time of 18 clocks for the line read and an average of 20 clocks for write back.)
The probability of experiencing a second cache miss during Tbusy is approximately
0.006*38 = 0.228 since there are .006 misses per clock in the absence of memory delays. (That is calculated from 2 IPC and 0.003 miss/inst.) The average penalty for the procesosr to wait for a miss during Tbusy is approximately 19 clocks (just half of the 38 total).
So the average miss penalty per instruction = .003 miss/inst * (24 + .228*19) clock/miss = .085.
The average execution time per instruction including memory delay = 0..585 CPI
This is 16% better performance than the simpler design evaluated above.
Note: There are many assumptions built into the analysis that are certainly inaccurate. But the method is still useful as a BOE ("Back Of the Envelope") approach to guide the design process before refining with more detailed simulation.
So the solution converges with achieved utilization of 27% vs. 30% with no contenction. That is, each processor slows down 10% due to bus contention with the other processor.