Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Address Widths - Computer Architecture and Engineering - Solved Exams, Exams of Computer Architecture and Organization

Kannur University Computer Architecture and Organization

Main points of this past exam are: Address Widths, Address Widths, Label Address, Matching Logic, Block Diagram, Data Result, Signal Coming, Cache-Line Offset, Floating-Point, Memory Misses

Typology: Exams

2012/2013

Uploaded on 04/02/2013

shashikanth_0p3 🇮🇳

4.8

(8)

55 documents

1 / 15

This page cannot be seen from the preview

Don't miss anything!

CS152 Fall ’99 Midterm II Page 1

University of California, Berkeley

College of Engineering

Computer Science Division  EECS

Fall 1999

John Kubiatowicz

Midterm II

SOLUTIONS

November 17, 1999

CS152 Computer Architecture and Engineering

Your Name:

SID Number:

Discussion

Section:

Problem Possible Score

1 25

2 25

3 25

4 25

Total

Partial preview of the text

Download Address Widths - Computer Architecture and Engineering - Solved Exams and more Exams Computer Architecture and Organization in PDF only on Docsity!

University of California, Berkeley College of Engineering Computer Science Division  EECS

Fall 1999 John Kubiatowicz

Midterm II

SOLUTIONS

November 17, 1999 CS152 Computer Architecture and Engineering

Your Name:

SID Number:

Discussion Section:

Problem Possible Score

Total

[ This page left for π ]

[ This page left for scratch ]

Problem 1d: Below is a series of memory read references set to the cache from part (a). Assume that the cache is initially empty and classify each memory references as a hit or a miss. Identify each miss as either compulsory, conflict, or capacity. One example is shown. Hint: start by splitting the address into components. Show your work.

Address Hit/Miss? Miss Type? 0x300 (^) Miss Compulsory 0x1BC (^) Miss Compulsory 0x206 (^) Miss Compulsory 0x109 (^) Miss Compulsory 0x308 (^) Miss Conflict 0x1A1 (^) Miss Compulsory 0x1B1 (^) Hit  0x2AE (^) Miss Compulsory 0x3B2 (^) Miss Compulsory 0x10C (^) Hit  0x205 (^) Miss Conflict 0x301 (^) Miss Conflict 0x3AE (^) Miss Compulsory 0x1A8 (^) Miss Conflict 0x3A1 (^) Hit  0x1BA (^) Hit 

Problem 1e: Calculate the miss rate and hit rate.

Hit Rate = 0. 25 16

Miss Rate =1 – Hit Rate = 0. 75 16

Problem 1f: You have a 500 MHz processor with 2-levels of cache, 1 level of DRAM, and a DISK for virtual memory. Assume that it has a Harvard architecture (separate instruction and data cache at level 1). Assume that the memory system has the following parameters:

Component Hit Time Miss Rate Block Size First-Level Cache 1 cycle^

4% Data 1% Instructions 64 bytes Second-Level Cache

20 cycles + 1 cycle/64bits 2%^ 128 bytes

DRAM (^) 25ns/8 bytes100ns+ 1% 16K bytes

DISK (^) 20ns/byte50ms + 0% 16K bytes

Finally, assume that there is a TLB that misses 0.1% of the time on data (doesn’t miss on instructions) and which has a fill penalty of 40 cycles. What is the average memory access time (AMAT) for Instructions? For Data (assume all reads)?

AMAT DISK =(5 × 107 ns) + (16384 × 20) =50327680ns/2ns-per-cycle = 25163840 cycles

AMAT DRAM = (100ns+25ns × 16)+ 0.01 × AMAT DISK = (500ns+503276.8ns)/2ns= 251888.4 cycles

AMAT L2 = (20 + 8) + 0.02 × AMAT DRAM =5065.77cycles

AMAT INST = (1+0.01 × AMAT L2 )=51.66 cycles

AMAT DATA = (1+0.04 × AMAT L2 +0.001 × 40) = 203.67 cycles

Why are these so high? Because our miss-rate for the disk (1%) is so high. This computer would technically be “thrashing”, i.e. spending all of its time moving pages to and from the disk.

Problem 1g: Suppose that we measure the following instruction mix for benchmark “X”: Loads: 20%, Stores: 15%, Integer: 30%, Floating-Point: 15% Branches: 20% Assume that we have a single-issue processor with a minimum CPI of 1.0. Assume that we have a branch predictor that is correct 95% of the time, and that an incorrect prediction costs 3 cycles. Finally, assume that data hazards cause an average penalty of 0.7 cycles for floating point operations. Integer operations run at maximum throughput. What is the average CPI of Benchmark X, including memory misses (from part g)?

CPI = CPINORMAL+CPICompute-stalls +CPIMemory-stalls

= 1 + (0.3 × 0 + 0.15 × 0.7 + 0.2 × 0.05 × 0.3) +(AMAT INST+0.35 × AMAT DATA ) =

= 1 + .135 + (51.66 + 0.35 × 203.67) =124.08 cycles/instruction

Problem #2: Superpipelining

Suppose that we have single-issue, in-order pipeline with one fetch stage, one decode stage, multiple execution stages (which include memory access) and a singe write-back stage. Assume that it has the following execution latencies (i.e. the number of stages that it takes to compute a value): multf (5 cycles), addf (3 cycles), divf (2 cycles), integer ops (1 cycle). Assume full bypassing and two cycles to perform memory accesses, i.e. loads and stores take a total of 3 cycles to execute (including address computation). Finally, branch conditions are computed by the first execution stage (integer execution unit).

Problem 2a: Assume that this pipeline consists of a single linear sequence of stages in which later stages serve as no-ops for shorter operations. Draw each stage of the pipeline as a box (no internal details) and name each of the stages. Describe what is computed in each stage and show all of the bypass paths (as arrows between stages). Your goal is to design a pipeline which never stalls unless a value is not ready. Label each of these arrows with the types of instructions that will forward their results along these paths (i.e. use “M” for multf, “D” for divf, “A” for addf, “I” for integer operations). [ Hint: be careful to optimize for information feeding into store instructions!]

Stage: F Fetch next instruction D Decode stage EX 1 Integer Ops Address compute (for Ld/St) First stage of: Addf,Multf,Divf EX 2 First stage of: Ld/St. Last Stage of: Divf, Second stage of: Addf, Mulff EX 3 Last stage of: Ld/St, Addf Third stage of: Multf EX 4 Fourth stage of: Multf EX 5 Last stage of: Multf W Writeback stage

Problem 2b: How many extra instructions are required between each of these instruction combinations to avoid stalls (i.e. assume that the second instruction uses a value from the first). Be careful!

Between a divf and an store: 0 Between a multf and an addf: 4 Between a load and a multf: 2 Between an addf and a divf: 2 Between two integer instructions: 0 Between an integer op and a store: 0

F D Ex 1 Ex 2 Ex 3 Ex 4 Ex 5 W

Ld,A

I,D,Ld,A

I,D,Ld,A,M

I,D

Problem #3: Fixing the loops

For this problem, assume that we have a superpipelined architecture like that in problem (2) with the following use latencies (these are not the right answers for problem #2b!): Between a multf and an addf: 3 insts Between a load and a multf/divf: 2 insts Between an addf and a divf: 1 insts Between a divf and a store: 7 insts Between an int op and a store: 0 insts Number of branch delay slots: 1 insts

Consider the following loop which performs a restricted rotation and projection operation. The array based at register r10 contains pairs of double-precision (64-bit) values which represent x,z coordinates. The array based at register r20 receives a projected coordinate along the observer’s horizontal direction:

loop: ldf $F20, 0($r10) multf $F6, $F20, $F addf $F12, $F6, $F ldf $F10, 8($r10) divf $F13, $F12, $F stf 0($r20), $F addi $r10, $r10,# addi $r20, $r20, # subi $r1, $r1, # bne $r1, $zero, loop nop

Problem 3a: How many cycles does this loop take per iteration? Indicate stalls in the above code by labeling each of them with a number of cycles of stall:

11 instructions + 14 stalls = 25 cycles/iteration

Problem 3b: Reschedule this code to run with as few cycles per iteration as possible. Do not unroll it or software pipeline it. How many cycles do you get per iteration of the loop now?

loop: ldf $F20, 0($r10) ldf $F10, 8($r10) multf $F6, $F20, $F addf $F12, $F6, $F addi $r10, $r10, # divf $F13, $F12, $F addi $r20, $r20, # subi $r1, $r1, # bne $r1, $zero, loop stf -8($r20), $F

There are many ways to rearrange/pipeline this code; one is shown above. Note that all of the optimal ones will result in 8 stalls. So:

10 instructions + 8 stalls = 18 cycles/iteration.

Problem 3c: Unroll the loop once and schedule it to run with as few cycles as possible per iteration of the original loop. How many cycles do you get per iteration now?

loop: ldf $F20, 0($r10) Total = (16+4)/2 =10 cycles/iter ldf $F10, 8($r10) ldf $F22, 16($r10) multf $F6, $F20, $F ldf $F12, 24($r10) multf $F7, $F22, $F addf $F12, $F6, $F divf $F13, $F12, $F addf $F16, $F7, $F divf $F14, $F16, $F addi $r10, $r10, # addi $r20, $r20, # subi $r1, $r1, # stf -16($r20), $F bne $r1, $zero, loop stf -8($r20), $F

Problem 3d: Your loop in (3c) will not run without stalls. Without going to the trouble to unroll further, what is the minimum number of times that you would have to unroll this loop to avoid stalls? Explain. How many cycles would you get per iteration then?

If we have ≥ 4 iterations, we can groups all the loads together, followed by the multiplies, adds, divides, integers, and stores. There will be no stalls except for the stores. Unfortunately, 4 iterations is not quite enough to avoid all stalls, since the first store will stall for one cycle: ..DDDDIIISSSBS has 6 instructions between loads and stores. Thus, we need 5 iterations. Cycles: (6 × 5+4)/5=6.8cycles/iteration.

Problem 3e: Software pipeline the original loop to avoid stalls. Overlap 5 different iterations. What is the average number of cycles per iteration? Your code should have no more than one copy of the original instructions. Ignore startup and exit code.

loop: stf 0($r20), $F divf $F13, $F12, $F ldf $F10, 40($r10) addf $F12, $F6, $F multf $F6, $F20, $F ldf $F20, 64($r10) addi $r10, $r10, # subi $r1, $r1, # bne $r1, $zero, loop addi $r20, $r20, #

This software pipelining problem had some subtleties with respect to the load placements that some of you got, but which we didn’t enforce. In particular, the division into phases as shown above. This division is enforced by data flow: phase-5 uses information generated in phase-4, phase-4 from phase-3, etc. Note the careful generation of offsets as well. There are 2 iterations between phase-1 and phase-3. In those two iterations, $r10 will have gained 32. Thus, to correct for this, 64($r10) in phase-1 becomes 32($r10) in phase 3. Since the phase-3 load is 8 bytes further than that, the phase-3 offset is (32+8)($r10) = 40($10).

This runs without stalls. So, the cycles/iteration = 10.

Phase – 5 [1 instruction]

Phase – 3 [2 instructions] Phase – 2 [1 instruction]

Phase – 1 [5 instructions]

Phase – 4 [1 instruction]

Problem #4: Short Answers

Problem 4a: Give a simple definition of precise interrupts/exceptions:

A precise interrupt/exception is one which generates a single instruction in the instruction stream for which all preceding instructions have completed and committed their results and for which the designated instruction and all following instructions have not committed any results (i.e. have not modified machine state).

Problem 4b: Explain how the presence of delayed branches complicates the description of a precise exception point ( Hint: what if there is a divide instruction in a delay slot that gets a divide by zero exception)?

To describe the precise exception point, one needs more than one PC. For instance, if there is a single delay slot, you need two PCs to describe the precise exception point (these are often called the PC and nPC – for next PC). For a machine with n delay slots, you need n+1 PCs.

The reason that you need multiple instructions is to properly restart the pipeline. Consider the case in which the precise exception point is at the delay slot instruction. In that case, the next instruction to execute on return is clearly the delay slot instruction. However, the following instruction might either be PC+4 or the target of the branch. Hence the need for the nPC.

Problem 4c: Explain the relationship between support for precise exceptions and support for branch prediction. What hardware structure supports both of these mechanisms in a modern out- of-order pipeline?

With out-of-order execution, precise exceptions require rolling back operations that have already occurred after the exception point. Branch prediction requires the same support (to rollback to the branch). The simplest structure to support rollback is the reorder buffer.

Problem 4d: Explain how pipelining can save power (and energy) for multimedia (streaming) applications:

Multimedia applications consists of large numbers of independent operations. Hence, if one can successfully pipeline the execution units, one can keep the overall clock rate constant and get the “same” throughput (i.e. there are no stalls due to data hazards). After pipelining, there is less logic between registers; as a result, we can lower the voltage without lowering the clock rate. The net result is a power savings without a reduction in throughput.

Problem 4e: A PalmPilot is a portable computing device that holds calendars and addresses. It has a micro-power mode that stops the clock and shuts down power to the processor when it is idle. Suppose that it also recognized when the battery was getting low and ran the clock at lower than normal speed during busy periods. Would this extend battery life? Why or why not?

No. It would not extend battery life. The reason is that slowing the clock down increases the amount of time that the PalmPilot must be “on” in order to complete a given task. The way to look at this is that the total number of transitions for a given operation (say looking up a name in the address book) is constant. Total consumed energy is dependent on number of transitions and voltage, neither of which have changed. If we changed the voltage as well, the answer would be different (but we didn’t specify this – in fact the PalmPilot versions 1-5 don’t allow voltage variation). Note that laptops change their clock when idle only because the software which runs on them is not particularly intelligent (i.e. it consumes energy during idle loops!).

Figure 1: A basic Tomasulo architecture

Problem 4f: The Tomasulo architecture (shown above) replaces a normal 5-stage pipeline with 4 stages: Fetch, Issue, Execute, and Writeback. One of its strengths is that it is able to Execute instructions in a different order than the programmer originally specified. The simplest version of this architecture also performs Writeback out-of-order as well. However, the Fetch and Issue stages of the Tomasulo architecture are always handled in program order. Why?

We have to Fetch and Issue in order so that we can analyze the dataflow between instructions and maintain the semantics of the program. Alternatively, you could think of the fact that register renaming requires handling each instruction in order so that data will flow properly between instructions which produce values and instructions that use these values.

Problem 4g: Pipelined architectures have three different types of data hazards with respect to registers. Name and define them. For each type, give a short code sequence that illustrates the hazard and describe how a Tomasulo architecture removes this hazard.

RAW: A “Read After Write” hazard occurs which an instruction produces a value which is used by a later instruction. Tomasulo uses the CDB to forward data to correct this type of hazard. Example: add $r1, $r2, $r sub $r5, $r1, $r

WAR: A “Write After Read” hazard occurs when a later instruction writes a register that needs to be read by an earlier instruction. If data flows from the later to earlier instruction, we will get incorrect operation. Tomasulo prevents this with register renaming. Example: sub $r5, $r1, $r add $r1, $r2, $r

WAW: A “Write After Write” hazard occurs when two instructions write a register. If these instructions write out of order, then the register will have an incorrect value in it. Tomasulo prevents this with register renaming. Example: add $r1, $r2, $r sub $r1, $r4, $r

Drtr

,QW ,QW ,QW

Ayhvtv

)ORDW )ORDW

A Hr AQSrtvr

89hh7897

U Hr

D pv Rrr

/RDG /RDG /RDG /RDG /RDG /RDG 6WRUH 6WRUH 6WRUH

Address Widths - Computer Architecture and Engineering - Solved Exams, Exams of Computer Architecture and Organization

Related documents

Partial preview of the text

Download Address Widths - Computer Architecture and Engineering - Solved Exams and more Exams Computer Architecture and Organization in PDF only on Docsity!

Midterm II

SOLUTIONS

[ This page left for π ]

[ This page left for scratch ]

AMAT DISK =(5 × 107 ns) + (16384 × 20) =50327680ns/2ns-per-cycle = 25163840 cycles

AMAT DRAM = (100ns+25ns × 16)+ 0.01 × AMAT DISK = (500ns+503276.8ns)/2ns= 251888.4 cycles

AMAT L2 = (20 + 8) + 0.02 × AMAT DRAM =5065.77cycles

AMAT INST = (1+0.01 × AMAT L2 )=51.66 cycles

AMAT DATA = (1+0.04 × AMAT L2 +0.001 × 40) = 203.67 cycles

= 1 + (0.3 × 0 + 0.15 × 0.7 + 0.2 × 0.05 × 0.3) +(AMAT INST+0.35 × AMAT DATA ) =

= 1 + .135 + (51.66 + 0.35 × 203.67) =124.08 cycles/instruction

Problem #2: Superpipelining

F D Ex 1 Ex 2 Ex 3 Ex 4 Ex 5 W

Problem #3: Fixing the loops

Problem #4: Short Answers

Figure 1: A basic Tomasulo architecture