Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Pipelining Basics- Lecture Slides, Slides of Computer Architecture and Organization

A Presentation slides for basics of pipelining

Typology: Slides

2016/2017

Uploaded on 11/08/2017

shashank95
shashank95 🇮🇳

5

(1)

5 documents

1 / 46

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CMPUT 429 - Computer Systems a
nd Architecture
1
CMPUT429/CMPE382 Winter
2001
Topic3-Pipelining
José Nelson Amaral
(Adapted from David A. Patterson’s CS252
lecture slides at Berkeley)
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e

Partial preview of the text

Download Pipelining Basics- Lecture Slides and more Slides Computer Architecture and Organization in PDF only on Docsity!

CMPUT 429 - Computer Systems a 1

CMPUT429/CMPE382 Winter

Topic3-Pipelining

José Nelson Amaral (Adapted from David A. Patterson’s CS lecture slides at Berkeley)

What is Pipelining?

Pipelining is a key implementation technique used to build fast processors. It allows the execution of multiple instructions to overlap in time.

A pipeline within a processor is similar to a car assembly line. Each assembly station is called a pipe stage or a pipe segment.

The throughput of an instruction pipeline is the measure of how often an instruction exits the pipeline.

Sequential Laundry

 (^) Sequential laundry takes 6 hours for 4 loads  (^) If they learned pipelining, how long would laundry take?

A

B

C

D

6 PM 7 8 9 10 11 Midnight

T a s k O r d e r

Time

Pipelined Laundry

Start work ASAP

 (^) Pipelined laundry takes 3.5 hours for 4 loads

A

B

C

D

6 PM 7 8 9 10 11 Midnight

T a s k O r d e r

Time

30 40 40 40 40 20 What is preventing them from doing it faster?

How could we eliminate this limiting factor?

M u x

0 1

M u x

0 1

M u x

0 1 2 3

0 1 2

M u x

PC

Sign ext.

Shift left 2

Conc/ left 2^ Shift

Read address Write address Write data^ MemData

Instruction [31-26] Instruction [25-0] Instruction register

Memory

Read register 1 Read register 2 Write register Write data

data 1^ Read data 2^ Read Registers 4 32

M u x

0 1 0 1

M u x

result^ ALU

Zero ALU

Target

16

4 26

32

I[25-21] I[20-16]

I[15-0]

[15-11]

5 Steps of MIPS Datapath

Figure 3.1, Page 130, CA:AQA 2e Memory Access

Writ e Back

Instruction Fetch

Instr. Decode Reg. Fetch

Execute Addr. Calc

L M D

ALU

MUX

Memory Reg File^ MUX (^) MUX^ Memory Data MUX

Extend^ Sign

Adder

Zero?

Next SEQ PC

Address

Next PC

WB Data

Inst (^) RD

RS RS

Imm

Steps to Execute Each

Instruction Type

Step I nstruction Type R-type load store branch jump Fetch I R (^)  Memory[PC] PC (^)  PC + 4 Decode A (^)  Registers[I R[25-21]] B (^)  Registers[I R[20-16]] Target (^)  PC + (sign-extend(I R[15-0]) << 2) Execute ALUopt (^)  A op B

ALUout (^)  A + sign-extend(I R[15-0])

I f(A==B) then PCTarget

PC (^)  concat(PC[31- 28], IR[25-0]) << 2 Memory Reg(R[15- 11]) (^)  ALUout

Memdata (^)  Mem[ALUout]

Mem[ALUout]B Write- back

Reg(I R[20- 16]) (^)  memdata

Pipeline Stages

We can divide the execution of an instruction into the following stages:

IF: Instruction Fetch ID: Instruction Decode EX: Execution MEM: Memory Access WB: Write Back

Pipeline Throughput and

Latency

IF ID EX MEM WB

5 ns 4 ns 5 ns 10 ns 4 ns Pipeline throughput: how often an instruction is comple

instr ns

instr ns ns ns ns ns

T instr lat IF lat ID lat EX lat MEM lat WB

1 / 10

1 /max 5 , 4 , 5 , 10 , 4

1 /max ( ), ( ), ( ), ( ), ( )

Pipeline latency: how long does it take to execute an instruction in the pipeline.

ns ns ns ns ns ns

L lat IF lat ID lat EX lat MEM lat WB 5 4 5 10 4 28

( ) ( ) ( ) ( ) ( )      

     Is this right?

Pipeline Throughput and

Latency

IF ID EX MEM WB

5 ns 4 ns 5 ns 10 ns 4 ns Simply adding the latencies to compute the pipelin latency, only would work for an isolated instruction

I1 IF ID EX MEM WB L(I1) = 28ns I2 IF ID EX MEM WBL(I2) = 33ns I3 IF ID EX MEM WBL(I3) = 38ns I4 IF ID MEM L(I5) = 43ns

EX WB We are in trouble! The latency is not constant. This happens because this is an unbalanced pipeline. The solution is to make every state the same length as the longest one.

Pipeline Throughput and

Latency

IF ID EX MEM WB

5 ns 4 ns 5 ns 10 ns 4 ns How long does it take to execute 20000 instructions in this pipeline? (disregard bubbles caused by branches, cache misses, and hazards)

How long would it take using the same modules without pipelining?

ExecTime (^) pipe  20000  10 ns  200000 ns  200  s

ExecTime (^) nonpipe  20000  28 ns  560000 ns  560  s

Pipeline Throughput and

Latency

IF ID EX MEM WB

5 ns 4 ns 5 ns 10 ns 4 ns

Thus the speedup that we got from the pipeline is:

s

s

ExecTime

ExecTime

Speedup

pipe

non pipe pipe

How can we improve this pipeline design?

We need to reduce the unbalance to increase the clock speed.

Pipeline Throughput and

Latency

IF ID EX MEM1^ WB

5 ns 4 ns 5 ns 5 ns 4 ns

MEM 5 ns

I1 IF ID EX MEM1MEM1WB I2 IF ID EX MEM1MEM1WB I3 IF ID EX MEM1MEM1WB I4 IF ID EX MEM1MEM1WB I5 IF ID EX MEM1MEM1WB I6 IF ID EX MEM1MEM1WB I7 IF ID EX MEM1MEM1WB

Pipeline Throughput and

Latency

IF ID EX MEM1^ WB

5 ns 4 ns 5 ns 5 ns 4 ns

MEM 5 ns

ExecTime (^) pipe  20000  5 ns  100000 ns  100  s

How long does it take to execute 20000 instructions in this pipeline? (disregard bubles caused by branches, cache misses, etc, for now)

Thus the speedup that we get from the pipeline is:

s

s

ExecTime

ExecTime

Speedup

pipe

non pipe pipe