EE382: Processor Design

Mid-Term Examination

February 10, 1998

Please do not open the exam book or begin work on the exam until instructed to do so.

You have a total of 2 hours to complete this exam. You will be informed when 2 hours have elapsed. You must stop all work on the exam at that time. You may use your textbook and notes during the exam, as well as a calculator. Show work and report your answers on each sheet. Use the blank sheet at the end of the exam, the back of the page, or attach additional sheets if necessary. Good Luck!

Your matriculation at Stanford University indicates that you have read and understood the Honor Code, and you agree to abide by the Code. Your signature here confirms that.

Signed: _________________________________

Name (Printed): __________________________

Stanford ID: ______________________________















SITN Students: Please attach routing slip.





Problem 1: Pipelining [25 points])

Table 1 lists the logic segments and associated combinational delay used to construct a pipeline. Additional factors to use in the analyzing the pipeline timing are:

C: clocking overhead including fixed-skew is 1 ns

k: variable skew stretch factor is 0

b: frequency of pipeline breaks is 0.1


Min Delay (ns)

Max Delay (ns)













Table 1

    1. Assume there is no restriction on the placement of pipeline latches. Then determine the optimal pipeline performance with conventional (not wave-pipelined) clocking, and report the following results. Remember that the pipeline must be an integral number of stages.

      (a) No. pipeline stages = __________ [4 points]

      (b) Cycle time = ____________ [4 points]

      (c) Performance in MIPS = ___________ [4 points]

    2. Now determine the performance using wave pipelining and report the following results.
      Assume the following:
      (i) pipeline breaks still occur with frequency b and
      (ii) when an instruction that causes a break is executed, that instruction passes through the complete pipeline and then the next instruction enters the pipeline at the following clock boundary.

      (a) Cycle time = ____________ [4 points]

      (b) What are the constructive clock skews at the end of each pipeline segment? [5 points]
      Segment 1 _________
      Segment 2 _________
      Segment 3 _________
      Segment 4 _________

      (c) Performance in MIPS = ___________ [4 points]


Problem 2: Cache Cost and Performance Models [25 points]

This problem addresses tradeoffs for deciding whether a microprocessor should use an L2 cache.

The microprocessor includes an integer pipeline capable of executing 1.5 CPI in the absence of cache delays. The entire microprocessor occupies 200 mm2, including integer core, FPU, MMU, L1 caches, I/O pads, and associated overhead. The technology is 0.5mm with a defect density of 1/cm2 and a cost of $5000 for a 21 cm wafer. The frequency is 150 MHz.

The microprocessor includes separate I and D caches that are each 8KB (8192 bytes) and 2-way associative with 32B lines. Caches are blocking. The number of memory references per instruction is on average 1.0 IF, 0.2 DR, and 0.1 DW. The miss delay going to memory is 20 clocks. The data cache is write-back, write-allocate. Assume that there is sufficient buffering so writing dirty lines back introduces no delay. The environment is multi-programmed with MP=3 and Q=20,000.

2.1 What is the cost per good part of the microprocessor? $___________ [6 points]

(Remember that the gross dice per wafer should be rounded to an integer, but the average good dice per wafer should not.)


2.2 What is the performance in MIPS of the processor? ___________________ [7 points]




Now we consider adding an L2 cache with everything else about the microprocessor described above remaining the same. The L2-cache tags and controller will be integrated with the microprocessor and the L2 data will be stored in separate SRAMs. The L2 cache combines instructions and data with a total capacity of 256KB (262144 bytes) using 32B lines with 4-way associative organization. The size of physical addresses is 32 bits. The L2 cache is write-back, write-allocate. Assume there is sufficient buffering so writing lines back introduces no delay.

2.3 The area in rbe of the L2-cache tags and controller is given by 195 + 0.6*1.2*tag_bits. Assume that this function can be added to the microprocessor with no additional area required for overhead or waste.

What is the area occupied by the L2-cache tags and controller in sq. mm? ____________
[6 points]


2.4 The delay for a miss in L1 with hit in L2 is 4 clocks. The additional delay for a miss in L2 going to memory is 20 clocks. (The total miss delay for miss in L1 and L2 is 24 clocks.)

What is the performance in MIPS of the processor with the L2 cache? _______________
[6 points]

Problem 3: Mean Buffer Performance Modeling [25 points]

A processor executes 1 IPC at 100 MHz. On average there are 0.1 data writes per instruction. The writes can be fully buffered.

The memory system has 4 modules and a cycle time of 100ns.

3.1 What is the mean number of writes buffered within the processor? ____________ [10 points]
State which queuing model you use and why.


3.2 The processor has a single write-buffer that combines write queues for all memory modules. You must choose the size of the write buffer to ensure it is full on less than 1% of writes. Use the Markov and Chebyshev inequalities to determine a conservative estimate for the buffer size.

1. Table 6.3 on p. 375 provides the variance of queue length.
2. Recall that when two random variables, X1 and X2, are independent then
Var(X1+X2) = Var(X1) + Var(X2)

What is the buffer size? ______________ [15 points]



Problem 4: Memory System Performance Modeling [25 points]

This problem addresses memory contention among cache misses in a multiprocessor system. The system consists of 4 processors and 8 memory modules. The memory has a cycle time of 100ns.

Each processor includes a single pipeline that operates at 200 MHz and is capable of executing 100 MIPS in the absence of memory contention (but including pipeline delays and cache miss delays). The number of memory references per instruction is 1.0 IF, 0.3 DR, 0.1 DW.

The processor includes separate instruction and data caches. The instruction cache miss rate is 5%, and the data cache miss rate is 10%. Each of the caches is blocking and prefetching is not used. The data cache is write-back and write-allocate. Assume that the write-back of dirty lines is buffered and serviced at lower priority than reads by any processor, so they cause no delay.

4.1 What are the following values:
State which queuing model you use and why.

d = __________ [5 points]

(b) offered bandwidth in MAPS = _________________ [5 points]
Note: MAPS refers to Millions of cache line Access Per Second

(c) achieved bandwidth in MAPS = _______________ [10 points]

(d) average performance of each CPU in MIPS = __________ [5 points]



4.2 If 50% of lines replaced from the data cache are dirty, the utilization of a memory module = ______.
[5 points]