Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Mca 502, Lecture notes of Computer Architecture and Organization

Notes - Notes

Typology: Lecture notes

2015/2016

Uploaded on 08/22/2016

Dr_Sunil.VK_Gaddam
Dr_Sunil.VK_Gaddam 🇮🇳

1 document

1 / 195

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Author: Dr. Deepti Mehrotra Vetter: Dr. Sandeep Arya
Lesson: Parallel computer models Lesson No. : 01
1.1 Objective
1.2 Introduction
1.3 The state of computing
1.3.1. Evolution of computer system
1.3.2 Elements of Modern Computers
1.3.3 Flynn's Classical Taxonomy
1.3.4 System attributes
1.4 Multiprocessor and multicomputer,
1.4.1 Shared memory multiprocessors
1.4.2 Distributed Memory Multiprocessors
1.4.3 A taxonomy of MIMD Computers
1.5 Multi vector and SIMD computers
1.5.1 Vector Supercomputer
1.5.2 SIMD supercomputers
1.6 PRAM and VLSI model
1.6.1 Parallel Random Access machines
1.6.2 VLSI Complexity Model
1.7 Keywords
1.8 Summary
1.9 Exercises
1.10 References
1.0 Objective
The main aim of this chapter is to learn about the evolution of computer systems, various
attributes on which performance of system is measured, classification of computers on
their ability to perform multiprocessing and various trends towards parallel processing.
1.1 Introduction
From an application point of view, the mainstream of usage of computer is experiencing
a trend of four ascending levels of sophistication:
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download Mca 502 and more Lecture notes Computer Architecture and Organization in PDF only on Docsity!

Author: Dr. Deepti Mehrotra Vetter: Dr. Sandeep Arya Lesson: Parallel computer models Lesson No. : 01

1.1 Objective 1.2 Introduction 1.3 The state of computing 1.3.1. Evolution of computer system 1.3.2 Elements of Modern Computers 1.3.3 Flynn's Classical Taxonomy 1.3.4 System attributes 1.4 Multiprocessor and multicomputer, 1.4.1 Shared memory multiprocessors 1.4.2 Distributed Memory Multiprocessors 1.4.3 A taxonomy of MIMD Computers 1.5 Multi vector and SIMD computers 1.5.1 Vector Supercomputer 1.5.2 SIMD supercomputers 1.6 PRAM and VLSI model 1.6.1 Parallel Random Access machines 1.6.2 VLSI Complexity Model 1.7 Keywords 1.8 Summary 1.9 Exercises 1.10 References 1.0 Objective The main aim of this chapter is to learn about the evolution of computer systems, various attributes on which performance of system is measured, classification of computers on their ability to perform multiprocessing and various trends towards parallel processing. 1.1 Introduction From an application point of view, the mainstream of usage of computer is experiencing a trend of four ascending levels of sophistication:

  • Data processing
  • Information processing
  • Knowledge processing
  • Intelligence processing With more and more data structures developed, many users are shifting to computer roles from pure data processing to information processing. A high degree of parallelism has been found at these levels. As the accumulated knowledge bases expanded rapidly in recent years, there grew a strong demand to use computers for knowledge processing. Intelligence is very difficult to create; its processing even more so. Todays computers are very fast and obedient and have many reliable memory cells to be qualified for data- information-knowledge processing. Parallel processing is emerging as one of the key technology in area of modern computers. Parallel appears in various forms such as lookahead, vectorization concurrency, simultaneity, data parallelism, interleaving, overlapping, multiplicity, replication, multiprogramming, multithreading and distributed computing at different processing level. 1.2 The state of computing Modern computers are equipped with powerful hardware technology at the same time loaded with sophisticated software packages. To access the art of computing we firstly review the history of computers then study the attributes used for analysis of performance of computers. 1.2.1 Evolution of computer system Presently the technology involved in designing of its hardware components of computers and its overall architecture is changing very rapidly for example: processor clock rate increase about 20% a year, its logic capacity improve at about 30% in a year; memory speed at increase about 10% in a year and memory capacity at about 60% increase a year also the disk capacity increase at a 60% a year and so overall cost per bit improves about 25% a year. But before we go further with design and organization issues of parallel computer architecture it is necessary to understand how computers had evolved. Initially, man used simple mechanical devices – abacus (about 500 BC) , knotted string, and the slide rule for

implementations. And designers always tried to manufacture a new machine that should be upward compatible with the older machines.

  1. Concept of specialized registers where introduced for example index registers were introduced in the Ferranti Mark I, concept of register that save the return-address instruction was introduced in UNIVAC I, also concept of immediate operands in IBM 704 and the detection of invalid operations in IBM 650 were introduced.
  2. Punch card or paper tape were the devices used at that time for storing the program. By the end of the 1950s IBM 650 became one of popular computers of that time and it used the drum memory on which programs were loaded from punch card or paper tape. Some high-end machines also introduced the concept of core memory which was able to provide higher speeds. Also hard disks started becoming popular.
  3. In the early 1950s as said earlier were design specific hence most of them were designed for some particular numerical processing tasks. Even many of them used decimal numbers as their base number system for designing instruction set. In such machine there were actually ten vacuum tubes per digit in each register.
  4. Software used was machine level language and assembly language.
  5. Mostly designed for scientific calculation and later some systems were developed for simple business systems.
  6. Architecture features Vacuum tubes and relay memories CPU driven by a program counter (PC) and accumulator Machines had only fixed-point arithmetic
  7. Software and Applications Machine and assembly language Single user at a time No subroutine linkage mechanisms Programmed I/O required continuous use of CPU
  8. examples: ENIAC, Princeton IAS, IBM 701

IInd generation of computers (1954 – 64) The transistors were invented by Bardeen, Brattain and Shockely in 1947 at Bell Labs and by the 1950s these transistors made an electronic revolution as the transistor is

smaller, cheaper and dissipate less heat as compared to vacuum tube. Now the transistors were used instead of a vacuum tube to construct computers. Another major invention was invention of magnetic cores for storage. These cores where used to large random access memories. These generation computers has better processing speed, larger memory capacity, smaller size as compared to pervious generation computer. The key features of this generation computers were

  1. The IInd^ generation computer were designed using Germanium transistor, this technology was much more reliable than vacuum tube technology.
  2. Use of transistor technology reduced the switching time 1 to 10 microseconds thus provide overall speed up.
  3. Magnetic cores were used main memory with capacity of 100 KB. Tapes and disk peripheral memory were used as secondary memory.
  4. Introduction to computer concept of instruction sets so that same program can be executed on different systems.
  5. High level languages, FORTRAN, COBOL, Algol, BATCH operating system.
  6. Computers were now used for extensive business applications, engineering design, optimation using Linear programming, Scientific research
  7. Binary number system very used.
  8. Technology and Architecture Discrete transistors and core memories I/O processors, multiplexed memory access Floating-point arithmetic available Register Transfer Language (RTL) developed
  9. Software and Applications High-level languages (HLL): FORTRAN, COBOL, ALGOL with compilers and subroutine libraries Batch operating system was used although mostly single user at a time
  10. Example : CDC 1604, UNIVAC LARC, IBM 7090

IIIrd Generation computers (1965 to 1974) In 1950 and 1960 the discrete components ( transistors, registers capacitors) were manufactured packaged in a separate containers. To design a computer these discrete

  1. Software and Applications Multiprogramming and time-sharing operating systems Multi-user applications
  2. Examples : IBM 360/370, CDC 6600, TI ASC, DEC PDP-

IVth Generation computer ( (1975 to 1990) The microprocessor was invented as a single VLSI (Very large Scale Integrated circuit) chip CPU. Main Memory chips of 1MB plus memory addresses were introduced as single VLSI chip. The caches were invented and placed within the main memory and microprocessor. These VLSIs and VVSLIs greatly reduced the space required in a computer and increased significantly the computational speed.

  1. Technology and Architecture feature LSI/VLSI circuits, semiconductor memory Multiprocessors, vector supercomputers, multicomputers Shared or distributed memory Vector processors Software and Applications Multprocessor operating systems, languages, compilers, parallel software tools Examples : VAX 9000, Cray X-MP, IBM 3090, BBN TC Fifth Generation computers( 1990 onwards) In the mid-to-late 1980s, in order to further improve the performance of the system the designers start using a technique known as “instruction pipelining”. The idea is to break the program into small instructions and the processor works on these instructions in different stages of completion. For example, the processor while calculating the result of the current instruction also retrieves the operands for the next instruction. Based on this concept later superscalar processor were designed, here to execute multiple instructions

in parallel we have multiple execution unit i.e., separate arithmetic-logic units (ALUs). Now instead executing single instruction at a time, the system divide program into several independent instructions and now CPU will look for several similar instructions that are not dependent on each other, and execute them in parallel. The example of this design are VLIW and EPIC.

  1. Technology and Architecture features ULSI/VHSIC processors, memory, and switches High-density packaging Scalable architecture Vector processors
  2. Software and Applications Massively parallel processing Grand challenge applications Heterogenous processing
  3. Examples : Fujitsu VPP500, Cray MPP, TMC CM-5, Intel Paragon Elements of Modern Computers The hardware, software, and programming elements of modern computer systems can be characterized by looking at a variety of factors in context of parallel computing these factors are:
  • Computing problems
  • Algorithms and data structures
  • Hardware resources
  • Operating systems
  • System software support
  • Compiler support Computing Problems
  • Numerical computing complex mathematical formulations tedious integer or floating -point computation
  • Transaction processing accurate transactions large database management information retrieval
  • Logical Reasoning logic inferences symbolic manipulations
  • Parallel software can be developed using entirely new languages designed specifically with parallel support as its goal, or by using extensions to existing sequential languages.
  • New languages have obvious advantages (like new constructs specifically for parallelism), but require additional programmer education and system software.
  • The most common approach is to extend an existing language. Compiler Support
  • Preprocessors use existing sequential compilers and specialized libraries to implement parallel constructs
  • Precompilers perform some program flow analysis, dependence checking, and limited parallel optimzations
  • Parallelizing Compilers requires full detection of parallelism in source code, and transformation of sequential code into parallel constructs
  • Compiler directives are often inserted into source code to aid compiler parallelizing efforts 1.2.3 Flynn's Classical Taxonomy Among mentioned above the one widely used since 1966, is Flynn's Taxonomy. This taxonomy distinguishes multi-processor computer architectures according two independent dimensions of Instruction stream and Data stream. An instruction stream is sequence of instructions executed by machine. And a data stream is a sequence of data including input, partial or temporary results used by instruction stream. Each of these dimensions can have only one of two possible states: Single or Multiple. Flynn’s classification depends on the distinction between the performance of control unit and the data processing unit rather than its operational and structural interconnections. Following are the four category of Flynn classification and characteristic feature of each of them. 1. Single instruction stream, single data stream (SISD)

Figure 1.1 Execution of instruction in SISD processors The figure 1.1 is represents a organization of simple SISD computer having one control unit, one processor unit and single memory unit.

Figure 1.2 SISD processor organization

  • They are also called scalar processor i.e., one instruction at a time and each instruction have only one set of operands.
  • Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle
  • Single data: only one data stream is being used as input during any one clock cycle
  • Deterministic execution
  • Instructions are executed sequentially.
  • This is the oldest and until recently, the most prevalent form of computer
  • Examples: most PCs, single CPU workstations and mainframes b) Single instruction stream, multiple data stream (SIMD) processors
  • A type of parallel computer
  • Single instruction: All processing units execute the same instruction issued by the control unit at any given clock cycle as shown in figure 13.5 where there are multiple processor executing instruction given by one control unit.

c) Multiple instruction stream, single data stream (MISD)

  • A single data stream is fed into multiple processing units.
  • Each processing unit operates on the data independently via independent instruction streams as shown in figure 1.5 a single data stream is forwarded to different processing unit which are connected to different control unit and execute instruction given to it by control unit to which it is attached.

Figure 1.5 MISD processor organization

  • Thus in these computers same data flow through a linear array of processors executing different instruction streams as shown in figure 1.6.
  • This architecture is also known as systolic arrays for pipelined execution of specific instructions.
  • Few actual examples of this class of parallel computer have ever existed. One is the experimental Carnegie-Mellon C.mmp computer (1971).
  • Some conceivable uses might be:
  1. multiple frequency filters operating on a single signal stream
  2. multiple cryptography algorithms attempting to crack a single coded message.

Figure 1.6 Execution of instructions in MISD processors

d) Multiple instruction stream, multiple data stream (MIMD)

  • Multiple Instruction: every processor may be executing a different instruction stream
  • Multiple Data: every processor may be working with a different data stream as shown in figure 1.7 multiple data stream is provided by shared memory.
  • Can be categorized as loosely coupled or tightly coupled depending on sharing of data and control
  • Execution can be synchronous or asynchronous, deterministic or non- deterministic

Figure 1.7 MIMD processor organizations

  • As shown in figure 1.8 there are different processor each processing different task.
  • Examples: most current supercomputers, networked parallel computer "grids" and multi-processor SMP computers - including some types of PCs.

Figure 1.8 execution of instructions MIMD processors

benchmarks were developed. Computer architects have come up with a variety of metrics to describe the computer performance. Clock rate and CPI / IPC : Since I/O and system overhead frequently overlaps processing by other programs, it is fair to consider only the CPU time used by a program, and the user CPU time is the most important factor. CPU is driven by a clock with a constant cycle time (usually measured in nanoseconds, which controls the rate of internal operations in the CPU. The clock mostly has the constant cycle time (t in nanoseconds). The inverse of the cycle time is the clock rate ( f = 1/τ, measured in megahertz). A shorter clock cycle time, or equivalently a larger number of cycles per second, implies more operations can be performed per unit time. The size of the program is determined by the instruction count (Ic). The size of a program is determined by its instruction count, I c , the number of machine instructions to be executed by the program. Different machine instructions require different numbers of clock cycles to execute. CPI (cycles per instruction) is thus an important parameter. Average CPI It is easy to determine the average number of cycles per instruction for a particular processor if we know the frequency of occurrence of each instruction type. Of course, any estimate is valid only for a specific set of programs (which defines the instruction mix), and then only if there are sufficiently large number of instructions. In general, the term CPI is used with respect to a particular instruction set and a given program mix. The time required to execute a program containing I c instructions is just T = I c * CPI * τ. Each instruction must be fetched from memory, decoded, then operands fetched from memory, the instruction executed, and the results stored. The time required to access memory is called the memory cycle time, which is usually k times the processor cycle time τ. The value of k depends on the memory technology and the processor-memory interconnection scheme. The processor cycles required for each instruction (CPI) can be attributed to cycles needed for instruction decode and execution ( p ), and cycles needed for memory references ( m* k ). The total time needed to execute a program can then be rewritten as T = I c* (p + mk)τ.

MIPS : The millions of instructions per second , this is calculated by dividing the number of instructions executed in a running program by time required to run the program. The MIPS rate is directly proportional to the clock rate and inversely proportion to the CPI. All four systems attributes (instruction set, compiler, processor, and memory technologies) affect the MIPS rate, which varies also from program to program. MIPS does not proved to be effective as it does not account for the fact that different systems often require different number of instruction to implement the program. It does not inform about how many instructions are required to perform a given task. With the variation in instruction styles, internal organization, and number of processors per system it is almost meaningless for comparing two systems. MFLOPS (pronounced megaflops'') stands formillions of floating point operations per second.'' This is often used as a bottom-line'' figure. If one know ahead of time how many operations a program needs to perform, one can divide the number of operations by the execution time to come up with a MFLOPS rating. For example, the standard algorithm for multiplying **n*n** matrices requires **2n**^3 **– n** operations (n^2 inner products, with **n** multiplications and **n-1** additions in each product). Suppose you compute the product of two 100 *100 matrices in 0.35 seconds. Then the computer achieves (2(100)^3 – 100)/0.35 = 5,714,000 ops/sec = 5.714 MFLOPS The termtheoretical peak MFLOPS'' refers to how many operations per second would be possible if the machine did nothing but numerical operations. It is obtained by calculating the time it takes to perform one operation and then computing how many of them could be done in one second. For example, if it takes 8 cycles to do one floating point multiplication, the cycle time on the machine is 20 nanoseconds, and arithmetic operations are not overlapped with one another, it takes 160ns for one multiplication, and (1,000,000,000 nanosecond/1sec)(1 multiplication / 160 nanosecond) = 6.2510^6 multiplication /sec so the theoretical peak performance is 6.25 MFLOPS. Of course, programs are not just long sequences of multiply and add instructions, so a machine rarely comes close to this level of performance on any real program. Most machines will achieve less than 10% of their peak rating, but vector processors or other machines with internal pipelines that have an effective CPI near 1.0 can often achieve 70% or more of their theoretical peak on small programs.

  • compiler technology (affects Ic and p and m )
  • CPU implementation and control (affects p *t ) cache and memory hierarchy (affects memory access latency, k ´t  )
  • Total CPU time can be used as a basis in estimating the execution rate of a processor. Programming Environments Programmability depends on the programming environment provided to the users. Conventional computers are used in a sequential programming environment with tools developed for a uniprocessor computer. Parallel computers need parallel tools that allow specification or easy detection of parallelism and operating systems that can perform parallel scheduling of concurrent events, shared memory allocation, and shared peripheral and communication links. Implicit Parallelism Use a conventional language (like C, Fortran, Lisp, or Pascal) to write the program. Use a parallelizing compiler to translate the source code into parallel code. The compiler must detect parallelism and assign target machine resources. Success relies heavily on the quality of the compiler. Explicit Parallelism Programmer writes explicit parallel code using parallel dialects of common languages. Compiler has reduced need to detect parallelism, but must still preserve existing parallelism and assign target machine resources. Needed Software Tools Parallel extensions of conventional high-level languages. Integrated environments to provide different levels of program abstraction validation, testing and debugging performance prediction and monitoring visualization support to aid program development, performance measurement graphics display and animation of computational results 1.3 MULTIPROCESSOR AND MULTICOMPUTERS Two categories of parallel computers are discussed below namely shared common memory or unshared distributed memory. 1.3.1 Shared memory multiprocessors
  • Shared memory parallel computers vary widely, but generally have in common the ability for all processors to access all memory as global address space.
  • Multiple processors can operate independently but share the same memory resources.
  • Changes in a memory location effected by one processor are visible to all other processors.
  • Shared memory machines can be divided into two main classes based upon memory access times: UMA , NUMA and COMA.

Uniform Memory Access (UMA):

  • Most commonly represented today by Symmetric Multiprocessor (SMP) machines
  • Identical processors
  • Equal access and access times to memory
  • Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if one processor updates a location in shared memory, all the other processors know about the update. Cache coherency is accomplished at the hardware level.

Figure 1.9 Shared Memory (UMA)