






































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
A Presentation slides for basics of pipelining
Typology: Slides
1 / 46
This page cannot be seen from the preview
Don't miss anything!
CMPUT 429 - Computer Systems a 1
José Nelson Amaral (Adapted from David A. Patterson’s CS lecture slides at Berkeley)
Pipelining is a key implementation technique used to build fast processors. It allows the execution of multiple instructions to overlap in time.
A pipeline within a processor is similar to a car assembly line. Each assembly station is called a pipe stage or a pipe segment.
The throughput of an instruction pipeline is the measure of how often an instruction exits the pipeline.
(^) Sequential laundry takes 6 hours for 4 loads (^) If they learned pipelining, how long would laundry take?
6 PM 7 8 9 10 11 Midnight
T a s k O r d e r
Time
(^) Pipelined laundry takes 3.5 hours for 4 loads
6 PM 7 8 9 10 11 Midnight
T a s k O r d e r
Time
30 40 40 40 40 20 What is preventing them from doing it faster?
How could we eliminate this limiting factor?
M u x
0 1
M u x
0 1
M u x
0 1 2 3
0 1 2
M u x
PC
Sign ext.
Shift left 2
Conc/ left 2^ Shift
Read address Write address Write data^ MemData
Instruction [31-26] Instruction [25-0] Instruction register
Memory
Read register 1 Read register 2 Write register Write data
data 1^ Read data 2^ Read Registers 4 32
M u x
0 1 0 1
M u x
result^ ALU
Zero ALU
Target
16
4 26
32
I[25-21] I[20-16]
I[15-0]
[15-11]
Figure 3.1, Page 130, CA:AQA 2e Memory Access
Writ e Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
L M D
ALU
MUX
Memory Reg File^ MUX (^) MUX^ Memory Data MUX
Extend^ Sign
Adder
Zero?
Next SEQ PC
Address
Next PC
WB Data
Inst (^) RD
RS RS
Imm
Step I nstruction Type R-type load store branch jump Fetch I R (^) Memory[PC] PC (^) PC + 4 Decode A (^) Registers[I R[25-21]] B (^) Registers[I R[20-16]] Target (^) PC + (sign-extend(I R[15-0]) << 2) Execute ALUopt (^) A op B
ALUout (^) A + sign-extend(I R[15-0])
I f(A==B) then PC Target
PC (^) concat(PC[31- 28], IR[25-0]) << 2 Memory Reg(R[15- 11]) (^) ALUout
Memdata (^) Mem[ALUout]
Mem[ALUout] B Write- back
Reg(I R[20- 16]) (^) memdata
We can divide the execution of an instruction into the following stages:
IF: Instruction Fetch ID: Instruction Decode EX: Execution MEM: Memory Access WB: Write Back
5 ns 4 ns 5 ns 10 ns 4 ns Pipeline throughput: how often an instruction is comple
instr ns
instr ns ns ns ns ns
T instr lat IF lat ID lat EX lat MEM lat WB
1 / 10
1 /max 5 , 4 , 5 , 10 , 4
1 /max ( ), ( ), ( ), ( ), ( )
Pipeline latency: how long does it take to execute an instruction in the pipeline.
ns ns ns ns ns ns
L lat IF lat ID lat EX lat MEM lat WB 5 4 5 10 4 28
( ) ( ) ( ) ( ) ( )
Is this right?
5 ns 4 ns 5 ns 10 ns 4 ns Simply adding the latencies to compute the pipelin latency, only would work for an isolated instruction
I1 IF ID EX MEM WB L(I1) = 28ns I2 IF ID EX MEM WBL(I2) = 33ns I3 IF ID EX MEM WBL(I3) = 38ns I4 IF ID MEM L(I5) = 43ns
EX WB We are in trouble! The latency is not constant. This happens because this is an unbalanced pipeline. The solution is to make every state the same length as the longest one.
5 ns 4 ns 5 ns 10 ns 4 ns How long does it take to execute 20000 instructions in this pipeline? (disregard bubbles caused by branches, cache misses, and hazards)
How long would it take using the same modules without pipelining?
ExecTime (^) pipe 20000 10 ns 200000 ns 200 s
ExecTime (^) non pipe 20000 28 ns 560000 ns 560 s
5 ns 4 ns 5 ns 10 ns 4 ns
Thus the speedup that we got from the pipeline is:
pipe
non pipe pipe
How can we improve this pipeline design?
We need to reduce the unbalance to increase the clock speed.
5 ns 4 ns 5 ns 5 ns 4 ns
MEM 5 ns
I1 IF ID EX MEM1MEM1WB I2 IF ID EX MEM1MEM1WB I3 IF ID EX MEM1MEM1WB I4 IF ID EX MEM1MEM1WB I5 IF ID EX MEM1MEM1WB I6 IF ID EX MEM1MEM1WB I7 IF ID EX MEM1MEM1WB
5 ns 4 ns 5 ns 5 ns 4 ns
MEM 5 ns
ExecTime (^) pipe 20000 5 ns 100000 ns 100 s
How long does it take to execute 20000 instructions in this pipeline? (disregard bubles caused by branches, cache misses, etc, for now)
Thus the speedup that we get from the pipeline is:
pipe
non pipe pipe