Understanding Speedup in Parallel Processing: A Look at the Ware Model | Study notes Parallel Computing and Programming

Frontiers of Supercomputing

The Efficiency of Parallel Processing

arallel processing, or the application of

several processors to a single task, is

an old idea with a relatively large

literature. The advent of very large-scale

integrated technology has made testing the

idea feasible, and the fact that single-

processor systems are approaching their

maximum performance level has made it

necessary. We shall show, however, that

successful use of parallel processing imposes

stringent performance requirements on algo-

rithms, software, and architecture.

The so-called asynchronous systems that

use a few tightly coupled high-speed proces-

sors are a natural evolution from high-speed

single-processor systems, Indeed, systems

with two to four processors will soon be

available (for example, the Cray X-MP, the

Cray-2, and the Control Data System 2XX).

Systems with eight to sixteen processors are

likely by the early 1990s. What are the

prospects of using the parallelism in such

systems to achieve high speed in the execu-

tion of a single application? Early attempts

with vector processing have shown that

plunging forward without a precise under-

standing of the factors involved can lead to

disastrous results. Such understanding will

be even more critical for systems now con-

templated that may use up to a thousand

processors.

The key issue in the parallel processing of

a single application is the speedup achieved,

especially its dependence on the number of

processors used. We define speedup (S) as

the factor by which the execution time for

the application changes: that is,

execution time for one processor

S=execution time for p processors

To estimate the speedup of a tightly

coupled system on a single application, we

use a model of parallel computation in-

troduced by Ware. We define a as the

fraction of work in the application that can

be processed in parallel. Then we make a

simplifying assumption of a two-state ma-

chine; that is, at any instant either all p

processors are operating or only one proc-

essor is operating. If we normalize the execu-

tion time for one processor to unity, then

S (p,a) =(l-a) + a/p

LOS ALAMOS SCIENCE Fall 1983

Note that the first term in the denominator is

the execution time devoted to that part of the

application that cannot be processed in

parallel, and the second term is the time for

that part that can be processed in parallel.

How does speedup vary with a? In par-

ticular. what is this relationship for a = 1. the

ideal limit of complete parallelization? Dif-

ferentiating S, we find that

The accompanying figure shows the Ware

model of speedup as a function of a for a 4-

processor, an 8-processor, and a 16-proc-

essor system. The quadratic dependence of

the derivative on p results in low speedup for

a less than 0.9. Consequently, to achieve

significant speedup, we must have highly

parallel algorithms. It is by no means evident

that algorithms in current use on single-

processor machines contain the requisite

parallelism. and research will be required to

find suitable replacements for those that do

not. Further, the highly parallel algorithms

available must be implemented with care.

For example, it is not sufficient to look at

just those portions of the application

amenable to parallelism because a is de-

termined by the entire application. For a

close to 1. changes in those few portions less

amenable to parallelism will cause small

changes in a, but the quadratic behavior of

the derivative will translate those small

changes in a into large changes in speedup.

Those who have experience with vector

processors will note a striking similarity

between the Ware curves and plots of vector

processor performance versus the fraction of

vectorizable computation. This similarity is

due to the assumption in the Ware model of

a two-state machine since a vector processor

can also be viewed in that manner. In one

state it is a relatively slow, general-purpose

machine. and in the other state it is capable

of high performance on vector operations,

Ware’s model is inadequate in that it

assumes that the instruction stream executed

on a parallel system is the same as that

executed on a single processor. Seldom is

this the case because multiple-processor sys-

tems usually require execution of instruc-

tions dealing with synchronization of the

processes and communication between

Speedup as a function of parallelism (a)

and number of processors.

processors. Further, parallel algorithms may

inherently require additional instructions. To

correct for this inadequacy, we add a term,

mentation that is at best nonnegative and

usually monotonically increasing with p.

the algorithm. the architecture, and even of

modified model. Then

If the application can be put completely in

parallel form, then

In other words, the maximum speedup of a

real system is less than the number of

processors p, and it may be significantly less.

Also note that, whatever the value of a, S

will have a maximum for sufficiently large p

continues to increase,

Thus the research challenge in parallel

processing involves finding algorithms, pro-

gramming languages, and parallel architec-

tures that, when used as a system, yield a

large amount of work processed in parallel

(large a) at the expense of a minimum num-

Partial preview of the text

Download Understanding Speedup in Parallel Processing: A Look at the Ware Model and more Study notes Parallel Computing and Programming in PDF only on Docsity!

Frontiers of Supercomputing

The Efficiency of Parallel Processing

P

arallel processing, or the application of several processors to a single task, is an old idea with a relatively large literature. The advent of very large-scale integrated technology has made testing the idea feasible, and the fact that single- processor systems are approaching their maximum performance level has made it necessary. We shall show, however, that successful use of parallel processing imposes stringent performance requirements on algo- rithms, software, and architecture. The so-called asynchronous systems that use a few tightly coupled high-speed proces- sors are a natural evolution from high-speed single-processor systems, Indeed, systems with two to four processors will soon be available (for example, the Cray X-MP, the Cray-2, and the Control Data System 2XX). Systems with eight to sixteen processors are likely by the early 1990s. What are the prospects of using the parallelism in such systems to achieve high speed in the execu- tion of a single application? Early attempts with vector processing have shown that plunging forward without a precise under- standing of the factors involved can lead to disastrous results. Such understanding will be even more critical for systems now con- templated that may use up to a thousand processors. The key issue in the parallel processing of a single application is the speedup achieved, especially its dependence on the number of processors used. We define speedup (S) as the factor by which the execution time for the application changes: that is,

execution time for one processor S = execution time for p processors

To estimate the speedup of a tightly coupled system on a single application, we use a model of parallel computation in- troduced by Ware. We define a as the fraction of work in the application that can be processed in parallel. Then we make a simplifying assumption of a two-state ma- chine; that is, at any instant either all p processors are operating or only one proc- essor is operating. If we normalize the execu- tion time for one processor to unity, then

1 S (p,a) =

(l-a) + a / p

LOS ALAMOS SCIENCE Fall 1983

Note that the first term in the denominator is the execution time devoted to that part of the

application that cannot be processed in

parallel, and the second term is the time for that part that can be processed in parallel. How does speedup vary with a? In par- ticular. what is this relationship for a = 1. the ideal limit of complete parallelization? Dif- ferentiating S, we find that

The accompanying figure shows the Ware model of speedup as a function of a for a 4- processor, an 8-processor, and a 16-proc- essor system. The quadratic dependence of

the derivative on p results in low speedup for

a less than 0.9. Consequently, to achieve

significant speedup, we must have highly

parallel algorithms. It is by no means evident that algorithms in current use on single- processor machines contain the requisite parallelism. and research will be required to find suitable replacements for those that do not. Further, the highly parallel algorithms available must be implemented with care. For example, it is not sufficient to look at just those portions of the application amenable to parallelism because a is de- termined by the entire application. For a close to 1. changes in those few portions less amenable to parallelism will cause small changes in a, but the quadratic behavior of the derivative will translate those small changes in a into large changes in speedup. Those who have experience with vector processors will note a striking similarity between the Ware curves and plots of vector processor performance versus the fraction of vectorizable computation. This similarity is due to the assumption in the Ware model of a two-state machine since a vector processor can also be viewed in that manner. In one state it is a relatively slow, general-purpose machine. and in the other state it is capable of high performance on vector operations, Ware’s model is inadequate in that it assumes that the instruction stream executed on a parallel system is the same as that executed on a single processor. Seldom is this the case because multiple-processor sys- tems usually require execution of instruc- tions dealing with synchronization of the processes and communication between

Speedup as a function of parallelism (a)

and number of processors.

processors. Further, parallel algorithms may inherently require additional instructions. To correct for this inadequacy, we add a term,

mentation that is at best nonnegative and

usually monotonically increasing with p.

the algorithm. the architecture, and even of

modified model. Then

If the application can be put completely in parallel form, then

In other words, the maximum speedup of a real system is less than the number of

processors p, and it may be significantly less.

Also note that, whatever the value of a, S will have a maximum for sufficiently large p

continues to increase, Thus the research challenge in parallel processing involves finding algorithms, pro- gramming languages, and parallel architec- tures that, when used as a system, yield a large amount of work processed in parallel (large a) at the expense of a minimum num-

Understanding Speedup in Parallel Processing: A Look at the Ware Model, Study notes of Parallel Computing and Programming

Related documents

Partial preview of the text

Download Understanding Speedup in Parallel Processing: A Look at the Ware Model and more Study notes Parallel Computing and Programming in PDF only on Docsity!

The Efficiency of Parallel Processing

P

(l-a) + a / p

application that cannot be processed in

the derivative on p results in low speedup for

significant speedup, we must have highly

Speedup as a function of parallelism (a)

and number of processors.

usually monotonically increasing with p.

processors p, and it may be significantly less.