Download Electronics communication engineering and describes its use and more Thesis Electronics in PDF only on Docsity!
CSE 30321 – Computer Architecture I – Fall 2010
Final Exam
December 13, 20 10
Test Guidelines:
- Place your name on EACH page of the test in the space provided.
- Answer every question in the space provided. If separate sheets are needed, make sure to include your name and clearly identify the problem being solved.
- Read each question carefully. Ask questions if anything needs to be clarified.
- The exam is open book and open notes.
- All other points of the ND Honor Code are in effect!
- Upon completion, please turn in the test and any scratch paper that you used.
Suggestion:
- Whenever possible, show your work and your thought process. This will make it easier for us to give you partial credit. Question Possible Points Your Points 1 17 2 10 3 17 4 15 5 15 6 17 7 9 Total 100
Problem 1 : (17 points)
Question A: (5 points) A cache may be organized such that: o In one case, there are more data elements per block and fewer blocks o In another case, there are fewer elements per block but more blocks However, in both cases – i.e. larger blocks but fewer of them OR shorter blocks, but more of them – the cache’s total capacity (amount of data storage) remains the same. What are the pros and cons of each organization? Support your answer with a short example assuming that the cache is direct mapped. Your answer must fit in the box below. Solution: If block size is larger:
- Con: There will be fewer blocks and hence a higher potential for conflict misses
- Pro: You may achieve better performance from spatial locality due to the larger block size
- Example: If you have a high degree of sequential data accesses, this makes more sense If there are fewer elements per block and more blocks:
- Con: You may be more subject to compulsory misses due to the smaller block size
- Pro: You may see fewer conflict misses due to more unique mappings
- Example: If you have more random memory accesses, this makes more sense Question B: (5 points) Assume:
- A processor has a direct mapped cache
- Data words are 8 bits long (i.e. 1 byte)
- Data addresses are to the word
- A physical address is 20 bits long
- The tag is 11 bits
- Each block holds 16 bytes of data How many blocks are in this cache? Solution: Given that the physical address is 20 bits long, and the tag is 11 bits, there are 9 bits left over for the index and offset We can determine the number of bits of offset as the problem states that:
- Data is word addressable and words are 8 bits long
- Each block holds 16 bytes As there are 8 bits / byte, each block holds 16 words, thus 4 bits of offset are needed. This means that there are 5 bits left for the index. Thus, there are 2^5 or 32 blocks in the cache.
Problem 2 : (10 points)
Question A: (5 points) The average memory access time for a microprocessor with 1 level of cache is 2.4 clock cycles
- If data is present and valid in the cache, it can be found in 1 clock cycle
- If data is not found in the cache, 80 clock cycles are needed to get it from off-chip memory Designers are trying to improve the average memory access time to obtain a 65% improvement in average memory access time, and are considering adding a 2nd^ level of cache on-chip.
- This second level of cache could be accessed in 6 clock cycles
- The addition of this cache does not affect the first level cache’s access patterns or hit times
- Off-chip accesses would still require 80 additional CCs. To obtain the desired speedup, how often must data be found in the 2nd^ level cache? Solution: We must first determine the miss rate of the L1 cache to use in the revised AMAT formula: AMAT = Hit Time + Miss Rate x Miss Penalty 2.4 = 1 + Miss Rate x 80 Miss Rate = 1.75% Next, we can calculate the target AMAT … i.e. AMATwith L2: Speedup = Time (old) / Time (new) 1.65 = 2.4 / Time (new) Time (new) = 1.4545 clock cycles We can then again use the AMAT formula to solve for highest acceptable miss rate in L2: 1.4545 = 1 + 0.0175 x (6 + (Miss RateL2)(80)) Solving for Miss RateL2 suggests that the highest possible miss rate is ~24.96% Thus, as the hit rate is ( 1 – Miss Rate), the L2 hit rate must be ~75%.
Question B: (5 points) Assume that the base CPI for a pipelined datapath on a single core system is 1.
- Note that this does NOT include the overhead associated with cache misses!!! Profiles of a benchmark suite that was run on this single core chip with an L1 cache suggest that for every 10,000,000 accesses to the cache, there are 308,752 L1 cache misses.
- If data is found in the cache, it can be accessed in 1 clock cycle, and there are no pipe stalls
- If data is not found in the cache, it can be accessed in 10 clock cycles Now, consider a multi-core chip system where each core has an equivalent L1 cache:
- All cores references a common, centralized, shared memory
- Potential conflicts to shared data are resolved by snooping and an MSI coherency protocol Benchmark profiling obtained by running the same benchmark suite on the multi-core system suggests that, on average, there are now 452,977 misses per 10,000,000 accesses.
- If data is found in a cache, it can still be accessed in 1 clock cycle
- On average, 14 cycles are now required to satisfy an L1 cache miss What must the CPI of the multi-core system be for it to be worthwhile to abandon the single core approach? Solution: We first need to calculate the CPI of the single core system and incorporate the overhead of cache misses: CPI = 1 + (308,753 / 10,000,000)(10) = 1. We need to better this number with the multi-core approach. Thus: 1.309 > X + (452,977 / 10,000,000)(14) 1.309 > X + 0. 0.675 > X Thus, the CPI must be less than 0.675 à reasonable as execution will proceed in parallel.
Solution: The final TLB state is as shown below:
- The entry with tag 0011 has been evicted.
- Other LRU bits have been updated accordingly
- The fact that the last entry involves a write will cause the dirty bit to be set to 1. Valid Dirty LRU Tag Physical Page # 1 0 4 0110 1000 1 1 1 1011 0000 1 0 3 1000 0110 1 0 2 0100 1010 Question B: (8 points)
- Assume that you have 4 Gbytes of main memory at your disposal
- 1 Gbyte of the 4 Gbytes has been reserved for process page table storage
- Each page table entry consists of: o A physical frame number o 1 valid bit o 1 dirty bit o 1 LRU status bit
- Virtual addresses are 32 bits
- Physical addresses are 26 bits
- The page size is 8 Kbytes How many process page tables can fit in the 1 Gbyte space? Solution: We can first calculate the number of page table entries associated with each process.
- 8 KB pages implies that the offset associated with a VA/PA is 13 bits (2^13 = 8KB)
- Thus, the remaining 19 bits are used to represent VPNs and serve as indices to the PT
- Thus, each PT has 2^19 entries. Next, each PA is 26 bits:
- 13 bits of the PA come from the offset, the other 13 come from a PT lookup
- Given that each PT entry consists of a PFN, a valid bit, a dirty bit, and a LRU bit, each PT entry will be 2 bytes Therefore:
of page tables = 2
30 bytes available x (1 PT entry / 2 bytes) x (1 process / 2 19 PT entries) # of page tables = 2^10 or 1024. Question C: (3 points) If a virtual-to-physical address translation takes ~100 clock cycles, what is the most likely reason for this particular latency? Your answer must fit in the box below. Solution:
- There was a TLB miss requiring a main memory access to read the page table
- The page table entry was valid and no page fault was incurred.
Problem 4 : (15 points)
This question considers the basic, MIPS, 5-stage pipeline (F, D, EX, M, WB). (For this problem, you may assume that there is full forwarding for all questions.) Question A: (5 points) Assume that you have the following sequence of pipelined instructions: lw $6, 0($7) add $8, $9, $ sub $11, $6, $ Where will the data operands that are processed during the EX stage of the subtract (sub) instruction come from? Also, you might take into account the following suggestions:
- Draw a simple diagram to help support your answer.
- If you are using more than a few sentences, or drawing anything put a simple picture to answer this question, you are making it too hard. You don’t need to draw any control signals, etc. Solution: IF stage IF/ID latch ID stage ID/EX latch EX stage EX/Mem latch M stage M/WB latch
WB
stage
- The data associated with $6 in the subtract instruction will come from the M/WB latch o Could also say register file assuming you can read/write in same CC
- The data associated with $8 in the subtract instruction will come from the EX/M latch
- Muxes are needed to pick between the original input from the register file and forwarded data Question B: (6 points) Show how the instructions in the sequence given below will proceed through the pipeline:
- We will predict that the beq instruction is not taken
- When the beq instruction is executed, the value in $1 is equal to the value in $ Solution: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 beq $1, $2, X F D E lw $10, 0($11) F D sub $14, $10, $10 F X: add $4, $1, $2 F D E M W lw $1, 0($4) F D E M W sub $1, $1, $1 F D D E M W add $1, $1, $1 F F D E M W Question C: (4 points) For the instruction mix above, on what instruction results does the last add instruction depend on? Solution: Just the sub instruction before it; the sub produces the only data consumed by the add instruction
Question C: (4 points) Assuming that N instructions are executed, and all N instructions are add instructions, what is the speedup of a pipelined implementation when compared to a multi-cycle implementation? Your answer should be an expression that is a function of N. Solution: For the multi-cycle approach:
- Each add instruction would take 4 clock cycles and each clock cycle would take 305 ps.
- Thus, the total time would be: 1220(N) For the pipelined approach:
- For N instructions, we can apply the formula: NT + (S-1)T
- Thus, the total time would be: o = 305(N) + (5-1)(305) o = 305N + 1220 ps Thus, the overall speedup is: 1220(N) / [305(N) + 1220] Question D: (4 points) This question should be answered independently of Questions A-C Assume you break up the memory stage into 2 stages instead of 1 to improve throughput in a pipelined datapath.
- Thus, the pipeline stages are now: F, D, EX, M1, M2, WB Show how the instructions below would progress though this 6 stage pipeline. Like before, full forwarding hardware is available. Solution: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 lw $5, 0($4) F D EX M1 M2 WB add $7, $5, $5 F D D D EX M1 M2 WB sub $8, $5, $9 F F F D EX M1 M2 WB
Problem 6 : (17 points)
Question A: (4 points) In your opinion, what are the 2 most significant impediments to obtaining speedup from N cores on a multi-core chip (irrespective of finding a good parallel algorithm)? You should also add a 1-2 sentence justification for each item. Your answer must fit in the box below. Solution: A good answer should mention 2 of the 3 things:
- Cache coherency overhead
- Contention for shared resources (i.e. the IC NW or higher level shared cache)
- Latency Question B: (9 points) As was discussed in lecture, as more and more cores are placed on chip, it can make sense to connect them with some sort of interconnection network to support core-to-core communication. Assume that you have a 6-core chip that you want to program to solve a given problem:
- You can use as few as 1 core or as many as 6 cores to solve the problem
- The problem require 450,000 iterations of a main loop to complete
- Each loop iteration requires 100 clock cycles
- Any startup overhead can be ignored o (i.e. instructions outside of the loop, to instantiate a new instance of the problem on another core, etc. etc.) If more than 1 core is used to solve a problem, communication overhead must be added to the total execution time,
- Communication overhead is a function of the number of cores used to solve the problem, and is specified in the table below: Number of cores used to solve problem Communication overhead per iteration 1 0 cycles 2 10 cycles 3 20 cycles 4 30 cycles 5 40 cycles 6 50 cycles
- Thus, for example, the communication overhead if 2 cores are used is: o 10 cycles / iteration x 450,000 iterations = 4,500,000 cycles How many cores should be used to solve this problem?
Problem 7 : (9 points)
A snapshot of the state associated with 2 caches, on 2 separate cores, in a centralized shared memory system is shown below. In this system, cache coherency is maintained with an MSI snooping protocol. You can assume that the caches are direct mapped. P0: Tag Data Word 1 Data Word 2 Data Word 3 Data Word 4 Coherency State Block 0 1000 10 20 30 40 M Block 1 4000 500 600 700 800 S … Block N 3000 2 4 6 8 S P1: Tag Data Word 1 Data Word 2 Data Word 3 Data Word 4 Coherency State Block 0 1000 10 10 10 10 I Block 1 8000 500 600 700 800 S … Block N 3000 2 4 6 8 S Question A: (3 points) If P0 wants to write Block 0, what happens to its coherency state? Solution: Nothing. It has the only modified copy. Question B: (3 points) If P1 writes to Block 1, is Block 1 on P0 invalidated? Why or why not? Solution: No. The tags are different so this is different data. Question C: (3 points) If P1 brings in Block M for reading, and no other cache has a copy, what state is it cached in? Solution: It would still be cached in the shared state because this is not the MESI protocol.