Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Understanding Speedup in Parallel Processing: A Look at the Ware Model, Study notes of Parallel Computing and Programming

The concept of parallel processing and the efficiency of using multiple processors to execute a single task. It introduces the ware model, which helps estimate the speedup of a tightly coupled system on a single application. The relationship between speedup and parallelism (a) and the number of processors, highlighting the importance of highly parallel algorithms for achieving significant speedup. It also mentions the limitations of the ware model and the need for research to find suitable replacements for algorithms that do not contain the requisite parallelism.

Typology: Study notes

2014/2015

Uploaded on 09/19/2015

harpal
harpal 🇮🇳

1 document

1 / 1

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Frontiers of Supercomputing
The Efficiency of Parallel Processing
P
arallel processing, or the application of
several processors to a single task, is
an old idea with a relatively large
literature. The advent of very large-scale
integrated technology has made testing the
idea feasible, and the fact that single-
processor systems are approaching their
maximum performance level has made it
necessary. We shall show, however, that
successful use of parallel processing imposes
stringent performance requirements on algo-
rithms, software, and architecture.
The so-called asynchronous systems that
use a few tightly coupled high-speed proces-
sors are a natural evolution from high-speed
single-processor systems, Indeed, systems
with two to four processors will soon be
available (for example, the Cray X-MP, the
Cray-2, and the Control Data System 2XX).
Systems with eight to sixteen processors are
likely by the early 1990s. What are the
prospects of using the parallelism in such
systems to achieve high speed in the execu-
tion of a single application? Early attempts
with vector processing have shown that
plunging forward without a precise under-
standing of the factors involved can lead to
disastrous results. Such understanding will
be even more critical for systems now con-
templated that may use up to a thousand
processors.
The key issue in the parallel processing of
a single application is the speedup achieved,
especially its dependence on the number of
processors used. We define speedup (S) as
the factor by which the execution time for
the application changes: that is,
execution time for one processor
S=execution time for p processors
To estimate the speedup of a tightly
coupled system on a single application, we
use a model of parallel computation in-
troduced by Ware. We define a as the
fraction of work in the application that can
be processed in parallel. Then we make a
simplifying assumption of a two-state ma-
chine; that is, at any instant either all p
processors are operating or only one proc-
essor is operating. If we normalize the execu-
tion time for one processor to unity, then
1
S (p,a) =(l-a) + a/p
LOS ALAMOS SCIENCE Fall 1983
Note that the first term in the denominator is
the execution time devoted to that part of the
application that cannot be processed in
parallel, and the second term is the time for
that part that can be processed in parallel.
How does speedup vary with a? In par-
ticular. what is this relationship for a = 1. the
ideal limit of complete parallelization? Dif-
ferentiating S, we find that
The accompanying figure shows the Ware
model of speedup as a function of a for a 4-
processor, an 8-processor, and a 16-proc-
essor system. The quadratic dependence of
the derivative on p results in low speedup for
a less than 0.9. Consequently, to achieve
significant speedup, we must have highly
parallel algorithms. It is by no means evident
that algorithms in current use on single-
processor machines contain the requisite
parallelism. and research will be required to
find suitable replacements for those that do
not. Further, the highly parallel algorithms
available must be implemented with care.
For example, it is not sufficient to look at
just those portions of the application
amenable to parallelism because a is de-
termined by the entire application. For a
close to 1. changes in those few portions less
amenable to parallelism will cause small
changes in a, but the quadratic behavior of
the derivative will translate those small
changes in a into large changes in speedup.
Those who have experience with vector
processors will note a striking similarity
between the Ware curves and plots of vector
processor performance versus the fraction of
vectorizable computation. This similarity is
due to the assumption in the Ware model of
a two-state machine since a vector processor
can also be viewed in that manner. In one
state it is a relatively slow, general-purpose
machine. and in the other state it is capable
of high performance on vector operations,
Ware’s model is inadequate in that it
assumes that the instruction stream executed
on a parallel system is the same as that
executed on a single processor. Seldom is
this the case because multiple-processor sys-
tems usually require execution of instruc-
tions dealing with synchronization of the
processes and communication between
Speedup as a function of parallelism (a)
and number of processors.
processors. Further, parallel algorithms may
inherently require additional instructions. To
correct for this inadequacy, we add a term,
mentation that is at best nonnegative and
usually monotonically increasing with p.
the algorithm. the architecture, and even of
modified model. Then
If the application can be put completely in
parallel form, then
In other words, the maximum speedup of a
real system is less than the number of
processors p, and it may be significantly less.
Also note that, whatever the value of a, S
will have a maximum for sufficiently large p
continues to increase,
Thus the research challenge in parallel
processing involves finding algorithms, pro-
gramming languages, and parallel architec-
tures that, when used as a system, yield a
large amount of work processed in parallel
(large a) at the expense of a minimum num-
71

Partial preview of the text

Download Understanding Speedup in Parallel Processing: A Look at the Ware Model and more Study notes Parallel Computing and Programming in PDF only on Docsity!

Frontiers of Supercomputing

The Efficiency of Parallel Processing

P

arallel processing, or the application of several processors to a single task, is an old idea with a relatively large literature. The advent of very large-scale integrated technology has made testing the idea feasible, and the fact that single- processor systems are approaching their maximum performance level has made it necessary. We shall show, however, that successful use of parallel processing imposes stringent performance requirements on algo- rithms, software, and architecture. The so-called asynchronous systems that use a few tightly coupled high-speed proces- sors are a natural evolution from high-speed single-processor systems, Indeed, systems with two to four processors will soon be available (for example, the Cray X-MP, the Cray-2, and the Control Data System 2XX). Systems with eight to sixteen processors are likely by the early 1990s. What are the prospects of using the parallelism in such systems to achieve high speed in the execu- tion of a single application? Early attempts with vector processing have shown that plunging forward without a precise under- standing of the factors involved can lead to disastrous results. Such understanding will be even more critical for systems now con- templated that may use up to a thousand processors. The key issue in the parallel processing of a single application is the speedup achieved, especially its dependence on the number of processors used. We define speedup (S) as the factor by which the execution time for the application changes: that is,

execution time for one processor S = execution time for p processors

To estimate the speedup of a tightly coupled system on a single application, we use a model of parallel computation in- troduced by Ware. We define a as the fraction of work in the application that can be processed in parallel. Then we make a simplifying assumption of a two-state ma- chine; that is, at any instant either all p processors are operating or only one proc- essor is operating. If we normalize the execu- tion time for one processor to unity, then

1 S (p,a) =

(l-a) + a / p

LOS ALAMOS SCIENCE Fall 1983

Note that the first term in the denominator is the execution time devoted to that part of the

application that cannot be processed in

parallel, and the second term is the time for that part that can be processed in parallel. How does speedup vary with a? In par- ticular. what is this relationship for a = 1. the ideal limit of complete parallelization? Dif- ferentiating S, we find that

The accompanying figure shows the Ware model of speedup as a function of a for a 4- processor, an 8-processor, and a 16-proc- essor system. The quadratic dependence of

the derivative on p results in low speedup for

a less than 0.9. Consequently, to achieve

significant speedup, we must have highly

parallel algorithms. It is by no means evident that algorithms in current use on single- processor machines contain the requisite parallelism. and research will be required to find suitable replacements for those that do not. Further, the highly parallel algorithms available must be implemented with care. For example, it is not sufficient to look at just those portions of the application amenable to parallelism because a is de- termined by the entire application. For a close to 1. changes in those few portions less amenable to parallelism will cause small changes in a, but the quadratic behavior of the derivative will translate those small changes in a into large changes in speedup. Those who have experience with vector processors will note a striking similarity between the Ware curves and plots of vector processor performance versus the fraction of vectorizable computation. This similarity is due to the assumption in the Ware model of a two-state machine since a vector processor can also be viewed in that manner. In one state it is a relatively slow, general-purpose machine. and in the other state it is capable of high performance on vector operations, Ware’s model is inadequate in that it assumes that the instruction stream executed on a parallel system is the same as that executed on a single processor. Seldom is this the case because multiple-processor sys- tems usually require execution of instruc- tions dealing with synchronization of the processes and communication between

Speedup as a function of parallelism (a)

and number of processors.

processors. Further, parallel algorithms may inherently require additional instructions. To correct for this inadequacy, we add a term,

mentation that is at best nonnegative and

usually monotonically increasing with p.

the algorithm. the architecture, and even of

modified model. Then

If the application can be put completely in parallel form, then

In other words, the maximum speedup of a real system is less than the number of

processors p, and it may be significantly less.

Also note that, whatever the value of a, S will have a maximum for sufficiently large p

continues to increase, Thus the research challenge in parallel processing involves finding algorithms, pro- gramming languages, and parallel architec- tures that, when used as a system, yield a large amount of work processed in parallel (large a) at the expense of a minimum num-