Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Calculation of Cache Memory Parameters, Lecture notes of Advanced Computer Architecture

Calculations for various cache memory parameters such as number of bits in physical address, block offset, line number, tag, cache size, main memory size, and tag directory size.

Typology: Lecture notes

2022/2023

Uploaded on 01/22/2024

adicherla-swathantra-kumar
adicherla-swathantra-kumar 🇮🇳

3 documents

1 / 112

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
5th Semester
Advanced Computer Architecture
Objectives
To understand the advance hardware and software issues of computer architecture
To understand the multi-processor architecture & connection mechanism
To understand multi-processor memory management
Module-I: (10 Hours)
Microprocessor and Microcontroller, RISC and CISC architectures, Parallelism, Pipelining fundamentals,
Arithmetic and Instruction pipelining, Pipeline Hazards, Superscalar Architecture, Super Pipelined
Architecture, VLIW Architecture, SPARC and ARM processors.
Module-II: (10 Hours)
Basic Multiprocessor Architecture: Flynn’s Classification, UMA, NUMA, Distributed Memory
Architecture, Array Processor, Vector Processors.
Module-III: (10 Hours)
Interconnection Networks: Static Networks, Network Topologies, Dynamic Networks, Cloud computing.
Module IV (10 Hours)
Memory Technology: Cache, Cache memory mapping policies, Cache updating schemes, Virtual
memory, Page replacement techniques, I/O subsystems.
Outcomes
Ability to analyze the abstraction of various advanced architecture of a computer
Ability to analyze the multi-processor architecture & connection mechanism
Ability to work out the tradeoffs involved in designing a modern computer system
Books:
[1] John L. Hennessy and David A. Patterson, Computer Architecture: A Quantitative Approach, Morgan
Kaufmann, 6th edition, 2017
[2] Carl Hamacher, Zvonko Vranesic, Safwat Zaky, Computer Organization, McGraw Hill, 5th Ed, 2014
[3] Kai Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programmability,
McGraw-Hill, 3rd Ed, 2015
Digital Learning Resources:
Course Name: Advanced Computer Architecture
Course Link: https://nptel.ac.in/courses/106/103/106103206/
Course Instructor: Prof.John Jose, IIT, Guwahati
Course Name: High Performance Computer Architecture
Course Link: https://nptel.ac.in/courses/106/105/106105033/
Course Instructor: Prof.A. Pal, IIT, Kharagpur
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download Calculation of Cache Memory Parameters and more Lecture notes Advanced Computer Architecture in PDF only on Docsity!

5 th Semester Advanced Computer Architecture Objectives

  • To understand the advance hardware and software issues of computer architecture
  • To understand the multi-processor architecture & connection mechanism
  • To understand multi-processor memory management Module-I: (10 Hours) Microprocessor and Microcontroller, RISC and CISC architectures, Parallelism, Pipelining fundamentals, Arithmetic and Instruction pipelining, Pipeline Hazards, Superscalar Architecture, Super Pipelined Architecture, VLIW Architecture, SPARC and ARM processors. Module-II: (10 Hours) Basic Multiprocessor Architecture: Flynn’s Classification, UMA, NUMA, Distributed Memory Architecture, Array Processor, Vector Processors. Module-III: (10 Hours) Interconnection Networks: Static Networks, Network Topologies, Dynamic Networks, Cloud computing. Module IV (10 Hours) Memory Technology: Cache, Cache memory mapping policies, Cache updating schemes, Virtual memory, Page replacement techniques, I/O subsystems. Outcomes
  • Ability to analyze the abstraction of various advanced architecture of a computer
  • Ability to analyze the multi-processor architecture & connection mechanism
  • Ability to work out the tradeoffs involved in designing a modern computer system Books: [1] John L. Hennessy and David A. Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufmann, 6th edition, 2017 [2] Carl Hamacher, Zvonko Vranesic, Safwat Zaky, Computer Organization, McGraw Hill, 5th^ Ed, 2014 [3] Kai Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programmability, McGraw-Hill, 3rd Ed, 2015 Digital Learning Resources: Course Name: Advanced Computer Architecture Course Link: https://nptel.ac.in/courses/106/103/106103206/ Course Instructor: Prof.John Jose, IIT, Guwahati Course Name: High Performance Computer Architecture Course Link: https://nptel.ac.in/courses/106/105/106105033/ Course Instructor: Prof.A. Pal, IIT, Kharagpur

Introduction to microprocessor and microcontroller

  • A microprocessor is an IC which has only the CPU inside them, i.e. only the processing powers such as Intel’s Pentium 1,2,3,4, core 2 duo, i3, i5 etc. These microprocessors don’t have RAM, ROM, and other peripherals on the chip. A system designer has to add them externally to make them functional.
  • Applications of microprocessor include Desktop PC’s, Laptops, notepads etc.
  • A microcontroller has a CPU, in addition with a fixed amount of RAM, ROM and other peripherals all embedded on a single chip. At times it is also termed as a mini computer or a computer on a single chip. Today different manufacturers produce microcontrollers with a wide range of features available in different versions. Some manufacturers are ATMEL, Microchip, TI, Freescale, Philips, Motorola etc.
  • Microcontrollers are designed to perform specific tasks. Specific means applications where the relationship of input and output is defined. Depending on the input, some processing needs to be done and output is delivered.
  • For example, keyboards, mouse, washing machine, digicam, pendrive, remote, microwave, cars, bikes, telephone, mobiles, watches, etc. Since the applications are very specific, they need small resources like RAM, ROM, I/O ports etc and hence can be embedded on a single chip. This in turn reduces the size and the cost.
  • Microprocessors find applications where tasks are unspecific like developing software, games, websites, photo editing, creating documents etc. In such cases the relationship between input and output is not defined. They need high amount of resources like RAM, ROM, I/O ports etc.
  • The clock speed of the Microprocessor is quite high as compared to the microcontroller. Whereas the microcontrollers operate from a few MHz to 30 to 50 MHz, today’s microprocessors operate above 1GHz as they perform complex tasks.
  • Generally, a microcontroller is far cheaper than a microprocessor. However, a microcontroller cannot be used in place of microprocessor and using a microprocessor is not advised in place of a microcontroller as it makes the application quite costly.
  • A microprocessor cannot be used stand alone. They need other peripherals like RAM, ROM, buffer, I/O ports etc and hence a system designed around a microprocessor is quite costly.

Evolution of Microprocessors

  • Transistor was invented in 1948 (23 December 1947 in Bell lab).
  • IC was invented in 1958 (Fair Child Semiconductors) By Texas Instruments J kilby.
  • First microprocessor was invented by INTEL (INTegrated ELectronics). Size of microprocessor – 4 bit Name Year Of Invention Clock Speed Number Of Transistors Inst. Per Sec INTEL 4004/ 1971 by Ted Hoff and Stanley Mazor 740 KHz 2300 60, Size of microprocessor – 8 bit Name Year Of Invention Clock Speed Number Of Transistors Inst. Per Sec 8008 1972 500 KHz 50, 8080 1974 2 MHz 60,000 10 times faster than 8008 8085 1976 (16 bit address bus) 3 MHz 6500 769230

Types of microprocessors: Complex instruction set microprocessor: The processors are designed to minimise the number of instructions per program and ignore the number of cycles per instructions. The compiler is used to translate a high-level language to assembly level language because the length of code is relatively short and an extra RAM is used to store the instructions. These processors can do tasks like downloading, uploading and recalling data from memory. Apart from these tasks these microprocessors can perform complex mathematical calculation in a single command. Example: IBM 370/168, VAX 11/78 0 Reduced instruction set microprocessor: These processor are made according to function. They are designed to reduce the execution time by using the simplified instruction set. They can carry out small things in specific commands. These processors complete commands at faster rate. They require only one clock cycle to implement a result at uniform execution time. There are number of registers and less number of transistors. To access the memory location LOAD and STORE instructions are used. Example: Power PC 601, 604, 615, 620 Super scalar microprocessor: These processors can perform many tasks at a time. They can be used for ALUs and multiplier like array. They have multiple operation unit and perform many tasks, executing multiple commands. Application specific integrated circuit: These processors are application specific like for personal digital assistant computers. They are designed according to proper specification. Digital signal multiprocessor: These processors are used to convert signals like analog to digital or digital to analog. The chips of these processors are used in many devices such as RADAR SONAR home theatres etc. Advantages of microprocessor –

  • High processing speed
  • Compact size
  • Easy maintenance
  • Can perform complex mathematics
  • Flexible
  • Can be improved according to requirement Disadvantages of microprocessors –
  • Overheating occurs due to overuse
  • Performance depends on size of data
  • Large board size than microcontrollers
  • Most microprocessors do not support floating point operations

Pipelining

To improve the performance of a CPU we have two options:

  1. Improve the hardware by introducing faster circuits.
  2. Arrange the hardware such that more than one operation can be performed at the same time. Since, there is a limit on the speed of hardware and the cost of faster circuits is quite high, we have to adopt the 2nd^ option. Pipelining: Pipelining is a process of arrangement of hardware elements of the CPU such that its overall performance is increased. Simultaneous execution of more than one instruction takes place in a pipelined processor. Example: Consider a water bottle packaging plant. Let there be 3 stages that a bottle should pass through, Inserting the bottle( I ), Filling water in the bottle( F ), and Sealing the bottle( S ). Let us consider these stages as stage 1, stage 2 and stage 3 respectively. Let each stage take 1 minute to complete its operation. Now, in a non-pipelined operation, a bottle is first inserted in the plant, after 1 minute it is moved to stage 2 where water is filled. Now, in stage 1 nothing is happening. Similarly, when the bottle moves to stage 3, both stage 1 and stage 2 are idle. But in pipelined operation, when the bottle is in stage 2, another bottle can be loaded at stage 1. Similarly, when the bottle is in stage 3, there can be one bottle each in stage 1 and stage 2. So, after each minute, we get a new bottle at the end of stage 3. Hence, the average time taken to manufacture 1 bottle is: Without pipelining = 9/3 minutes = 3m I F S | | | | | | | | | I F S | | | | | | | | | I F S (9 minutes) With pipelining = 5/3 minutes = 1.67m I F S | | | I F S | | | I F S (5 minutes) Thus, pipelined operation increases the efficiency of a system. Design of a basic pipeline In a pipelined processor, a pipeline has two ends, the input end and the output end. Between these ends, there are multiple stages/segments such that output of one stage is connected to input of next stage and each stage performs a specific operation. Interface registers are used to hold the intermediate output between two stages. These interface registers are also called latch or buffer. All the stages in the pipeline along with the interface registers are controlled by a common clock. Execution in a pipelined processor: Execution sequence of instructions in a pipelined processor can be visualized using a space-time diagram. For example, consider a processor having 4 stages and let there be 2 instructions to be executed. We can visualize the execution sequence through the following space-time diagrams: Non overlapped execution: STAGE / CYCLE 1 2 3 4 5 6 7 8 S1 I 1 I 2 S2 I 1 I 2 S3 I 1 I 2 S4 I 1 I 2 Total time = 8 Cycle

Throughput = Number of instructions / Total time to complete the instructions So, Throughput = n / (k + n – 1) * Tp Note: The cycles per instruction (CPI) value of an ideal pipelined processor is 1 Problem (example): Consider a pipeline having 4 phases with duration 60, 50, 90 and 80 ns. Given latch delay is 10 ns. Calculate-

  1. Pipeline cycle time
  2. Non-pipeline execution time
  3. Speed up ratio
  4. Pipeline time for 1000 tasks
  5. Sequential time for 1000 tasks
  6. Throughput Solution- Given-
  • Four stage pipeline is used k=
  • Delay of stages = 60, 50, 90 and 80 ns
  • Latch delay or delay due to each register = 10 ns 1: Pipeline Cycle Time- Cycle time = Maximum delay due to any stage + Delay due to its register = Max { 60, 50, 90, 80 } + 10 ns = 90 ns + 10 ns = 100 ns 2: Non-Pipeline Execution Time- (no latches hence latch delay=0) Non-pipeline execution time for one instruction = 60 ns + 50 ns + 90 ns + 80 ns = 280 ns 3: Speed Up Ratio- Speed up = Non-pipeline execution time / Pipeline execution time = 280 ns / 100 ns = 2. 4: Pipeline Time For 1000 Tasks- Pipeline time for 1000 tasks = Time taken for 1st task + Time taken for remaining 999 tasks = 1 x 4 clock cycles + 999 x 1 clock cycle = 4 x cycle time + 999 x cycle time = 4 x 100 ns + 999 x 100 ns = 400 ns + 99900 ns = 100300 ns 5: Sequential Time For 1000 Tasks- Non-pipeline time for 1000 tasks = 1000 x Time taken for one task = 1000 x 280 ns = 280000 ns 6: Throughput- Throughput for pipelined execution = Number of instructions executed per unit time = 1000 tasks / 100300 ns

Dependencies in a pipelined processor

Pipeline hazards

There are mainly three types of dependencies possible in a pipelined processor. These are :

  1. Structural Dependency
  2. Control Dependency
  3. Data Dependency These dependencies may introduce stalls in the pipeline. Stall : A stall is a cycle in the pipeline without new input.

1. Structural dependency This dependency arises due to the resource conflict in the pipeline. A resource conflict is a situation when more than one instruction tries to access the same resource in the same cycle. A resource can be a register, memory, or ALU. Example: INSTRUCTION / CYCLE 1 2 3 4 5 I 1 IF(Mem) ID EX Mem I 2 IF(Mem) ID EX I 3 IF(Mem) ID EX I 4 IF(Mem) ID

  • In the above scenario, in cycle 4, instructions I 1 and I 4 are trying to access same resource (Memory) which introduces a resource conflict.
  • To avoid this problem, we have to keep the instruction on wait until the required resource (memory in our case) becomes available. This wait will introduce stalls in the pipeline as shown below: CYCLE 1 2 3 4 5 6 7 8 I 1 IF(Mem) ID EX Mem WB I 2 IF(Mem) ID EX Mem WB I 3 IF(Mem) ID EX Mem WB I 4 – – – IF(Mem) Solution for structural dependency To minimize structural dependency stalls in the pipeline, we use a hardware mechanism called Renaming. Renaming: According to renaming, we divide the memory into two independent modules used to store the instruction and data separately called Code memory (CM) and Data memory (DM) respectively. CM will contain all the instructions and DM will contain all the operands that are required for the instructions. INSTRUCTION/ CYCLE 1 2 3 4 5 6 7 I 1 IF(CM) ID EX DM WB I 2 IF(CM) ID EX DM WB I 3 IF(CM) ID EX DM WB I 4 IF(CM) ID EX DM I 5 IF(CM) ID EX I 6 IF(CM) ID I 7 IF(CM)

3. Data Dependency (Data Hazard)

Let us consider an ADD instruction S, such that S: ADD R1, R2, R Addresses read by S = I(S) = {R2, R3} Addresses written by S = O(S) = {R1} Now, we say that instruction S2 depends in instruction S1, when This condition is called Bernstein condition. Three cases exist:

  • Flow (data) dependence: O(S1) ∩ I (S2), S1 → S2 and S1 writes after something read by S
  • Anti-dependence: I(S1) ∩ O(S2), S1 → S2 and S1 reads something before S2 overwrites it
  • Output dependence: O(S1) ∩ O(S2), S1 → S2 and both write the same memory location. Example: Let there be two instructions I1 and I2 such that: I1: ADD R1, R2, R I2: SUB R4, R1, R When the above instructions are executed in a pipelined processor, then data dependency condition will occur, which means that I 2 tries to read the data before I 1 writes it, therefore, I 2 incorrectly gets the old value from I 1. INSTRUCTION / CYCLE 1 2 3 4 I 1 IF ID EX DM I 2 IF ID(Old value) EX To minimize data dependency stalls in the pipeline, operand forwarding is used. Operand Forwarding : In operand forwarding, we use the interface registers present between the stages to hold intermediate output so that dependent instruction can access new value from the interface register directly. Considering the same example: I1: ADD R1, R2, R I2: SUB R4, R1, R INSTRUCTION / CYCLE 1 2 3 4 I 1 IF ID EX DM I 2 IF ID EX Data Hazards Data hazards occur when instructions that exhibit data dependence, modify data in different stages of a pipeline. Hazard cause delays in the pipeline. There are mainly three types of data hazards:
  1. RAW (Read after Write) [Flow/True data dependency]
  2. WAR (Write after Read) [Anti-Data dependency]
  3. WAW (Write after Write) [Output data dependency] Let there be two instructions I and J, such that J follow I. Then,
  4. RAW hazard occurs when instruction J tries to read data before instruction, I writes it. Eg: I: R2 <- R1 + R J: R4 <- R2 + R
  1. WAR hazard occurs when instruction J tries to write data before instruction I reads it. Eg: I: R2 <- R1 + R J: R3 <- R4 + R
  2. WAW hazard occurs when instruction J tries to write output before instruction I writes it. Eg: I: R2 <- R1 + R J: R2 <- R4 + R WAR and WAW hazards occur during the out-of-order execution of the instructions.

Pipelining: Types and Stalling

Types of pipeline

1. Uniform delay pipeline: - In this type of pipeline, all the stages will take same time to complete an operation. - In uniform delay pipeline, Cycle Time (Tp) = Stage Delay - If buffers are included between the stages then, **Cycle Time (Tp) = Stage Delay + Buffer Delay

  1. Non-Uniform delay pipeline:**
    • In this type of pipeline, different stages take different time to complete an operation.
    • In this type of pipeline, Cycle Time (Tp) = Maximum (Stage Delay)
    • For example, if there are 4 stages with delays, 1 ns, 2 ns, 3 ns, and 4 ns, then Tp = Maximum (1 ns, 2 ns, 3 ns, 4 ns) = 4 ns
    • If buffers are included between the stages, Tp = Maximum (Stage delay + Buffer delay) Example: Consider a 4-segment pipeline with stage delays (2 ns, 8 ns, 3 ns, 10 ns). Find the time taken to execute 100 tasks in the above pipeline. Solution: As the above pipeline is a non-linear pipeline, Tp = max (2, 8, 3, 10) = 10 ns We know that ETpipeline = (k + n – 1) Tp = (4 + 100 – 1) 10 ns = 1030 ns NOTE: MIPS = Million instructions per second Performance of pipeline with stalls Speed Up (S) = Performancepipeline / Performancenon-pipeline => S = Average Execution Timenon-pipeline / Average Execution Timepipeline => S = CPInon-pipeline * Cycle Timenon-pipeline / CPIpipeline * Cycle Timepipeline Ideal CPI of the pipelined processor is ‘1’. But due to stalls, it becomes greater than ‘1’. => S = CPInon-pipeline * Cycle Timenon-pipeline / (1 + Number of stalls per Instruction) * Cycle Timepipeline As Cycle Timenon-pipeline = Cycle Timepipeline, Speed Up (S) = CPInon-pipeline / (1 + Number of stalls per instruction) Dynamic Instruction Scheduling: If the programmer is aware of a pipelined architecture then it may be possible to rewrite programs statically either manually or using an optimising compiler to separate data dependencies otherwise they must be detected and resolved by hardware at runtime. Scoreboarding: A scoreboard is a centralized control logic which uses forwarding logic and register tagging to keep track of the status of registers and multiple functional units. An issued instruction whose registers are not available is forwarded to a reservation station (buffer) associated with the functional unit it will use. When functional units generate new results, some data dependencies can be resolved. When all registers have valid data the scoreboard enables the instruction execution. Similarly, when a functional unit finishes, it signals the scoreboard to release the register resources.

2. Instruction Pipeline: - In this a stream of instructions can be executed by overlapping fetch, decode and execute phases of an instruction cycle. This type of technique is used to increase the throughput of the computer system. An instruction pipeline reads instruction from the memory while previous instructions are being executed in other segments of the pipeline. Thus we can execute multiple instructions simultaneously. The pipeline will be more efficient if the instruction cycle is divided into segments of equal duration. - In the most general case computer needs to process each instruction in following sequence of steps: o Fetch the instruction from memory (FI) o Decode the instruction (DA) o Calculate the effective address o Fetch the operands from memory (FO) o Execute the instruction (EX) o Store the result in the proper place The flowchart for instruction pipeline is shown below.

Let us see an example of instruction pipeline. Example: Here the instruction is fetched on first clock cycle in segment 1.

  • Now it is decoded in next clock cycle, then operands are fetched and finally the instruction is executed. We can see that here the fetch and decode phase overlap due to pipelining. By the time the first instruction is being decoded, next instruction is fetched by the pipeline.
  • In case of third instruction we see that it is a branched instruction. Here when it is being decoded 4th instruction is fetched simultaneously. But as it is a branched instruction it may point to some other instruction when it is decoded. Thus, fourth instruction is kept on hold until the branched instruction is executed. When it gets executed then the fourth instruction is copied back and the other phases continue as usual.

Superscalar Architecture

A more useful approach is to equip the processor with multiple processing units to handle several instructions in parallel in each processing stage. With this arrangement, several instructions start execution in the same clock cycle and the process is said to use multiple issue. Such processors are capable of achieving an instruction execution throughput of more than one instruction per cycle. They are known as ‘Superscalar Processors’.

  • I1 requires two cycles to execute.
  • I3 and I4 conflict for the same functional unit.
  • I5 depends on the value produced by I4.
  • I5 and I6 conflict for a functional unit. Instructions are only decoded up to the point of a dependency or resource conflict. No additional instructions are decoded until the conflict is resolved. This means a maximum of two instructions can be in the execute stage as later instructions have a time dependency on earlier ones executing first. 2) In-Order Issue with Out-of-Order Completion: Using the same set of instructions, the next diagram illustrates the effect of allowing some instructions to complete out-of-order. With out-of-order completion, any number of instructions may be in the execution stage, limited only by the machine's parallelism. Instruction issuing in any one pipeline is stalled by a resource conflict, data dependency or procedural dependency.
  • Note that I2 is allowed to complete before I1. I5 depends on the value produced by I4 and cannot be issued until cycle 5.
  • Out-of-order completion requires more complex instruction-issue logic than in-order completion. It is more difficult to deal with instruction interrupts and exceptions. When an interrupt occurs, the processor must take into account that instructions ahead of the instruction that caused the interrupt may have already completed.
  • The time from decoding the first instruction to writing the last is 7 cycles. 3) Out-of-Order Issue with Out-of-Order Completion: To allow out-of-order issue, it is necessary to decouple the decode and execute stages of the pipeline. When an instruction has been decoded it is placed in a buffer known as an instruction window. As long as this buffer is not full, the processor can continue to fetch and decode new instructions. When a functional unit becomes available an instruction from the window may be issued. Any instruction that needs the functional unit and which has no conflicts or dependencies to block it may be selected.
  • Note that it is possible to issue I6 before I5 as I5 has a dependency on I4. The time from decoding the first instruction to writing the last is 6 cycles. The instruction window can be centralised or distributed. A centralised instruction window holds all instructions irrespective of their type. In the distributed approach, instruction buffers called reservation stations are placed in front of each functional unit. Decoded instructions are routed to the appropriate reservation station and subsequently issued to the functional unit when it is free and all operands for the instruction have been received by the reservation station. Advantages of Superscalar Architecture : o In a Superscalar Processor, the detrimental effect on performance of various hazards becomes even more pronounced. o The compiler can avoid many hazards through judicious selection and ordering of instructions. o The compiler should strive to interleave floating point and integer instructions. This would enable the dispatch unit to keep both the integer and floating point units busy most of the time. o In general, high performance is achieved if the compiler is able to arrange program instructions to take maximum advantage of the available hardware units. Disadvantages of Superscalar Architecture : o Due to this type of architecture, problem in scheduling can occur. COMPARISON BETWEEN PIPELINING & SUPERSCALAR Pipelining Superscalar divides an instruction into steps, and since each step is executed in a different part of the processor, multiple instructions can be in different “phases” each clock. involves the processor being able to issue multiple instructions in a single clock with redundant facilities to execute an instruction within a single core once one instruction was done decoding and went on towards the next execution subunit multiple execution subunits able to do the same thing in parallel Sequencing unrelated activities such that they use different components at the same time Multiple sub-components capable of doing the same task simultaneously, but with the processor deciding how to do it.

Superpipelining

An alternative approach to achieving better performance is superpipelining. Superpipelining exploits the fact that many pipeline stages perform tasks that require less than half a clock cycle. A doubled internal clock speed allows those stages to perform two tasks during one external clock cycle. In a superpipelined processor of degree n, the pipeline cycle time is 1/n of the base cycle. Stages that require the full base cycle to complete can be strung into a series of shorter stages, effectively increasing the length of the pipeline and matching the execution latency of each stage. An number of instructions may be in various parts of the execution stage. As a comparison, where an execution operation takes 1

  • In the scalar base machine one instruction is issued per cycle, with one cycle latency for simple operations and one cycle latency between instructions. The instruction pipeline can be fully utilised if successive instructions can enter it continuously at the rate of one per cycle.
  • In a superpipelined superscalar design of degree (m,n) the machine executes m instructions every cycle with a pipeline cycle 1/n of the base cycle. Simple operation latency is n pipeline cycles. The level of parallelism required to fully utilise this machine is mn instructions.
  • The superscalar approach depends on the ability to execute multiple instructions in parallel. A combination of the compiler-based optimisation and various hardware techniques can be used to maximise instruction level parallelism. COMPARISON BETWEEN SUPERPIPELINING & SUPERSCALAR
  • Super-pipelining attempts to increase performance by reducing the clock cycle time. It achieves that by making each pipeline stage very shallow, resulting in a large number of pipe stages. A shorter clock cycle means a faster clock. As long as your cycles per instruction (CPI) doesn’t change, a faster clock means better performance. Super-pipelining works best with code that doesn’t branch often, or has easily predicted branches.
  • Superscalar attempts to increase performance by executing multiple instructions in parallel. If we can issue more instructions every cycle, without decreasing clock rate, then CPI decreases, therefore increasing performance.
  • Superscalar breaks into two broad categories: In-order and out-of-order. o In-order superscalar mainly provides benefit to code with instruction-level parallelism among a small window of consecutive instructions. o Out-of-order superscalar allows the pipeline to find parallelism across larger windows of code, and to hide latencies associated with long-running instructions. (Example: load instructions that miss the cache.) NOTE: o Super-pipelining seeks to improve the sequential instruction rate, while superscalar seeks to improve the parallel instruction rate. o Most modern processors are both superscalar and super-pipelined. They have deep pipelines to achieve high clock rates, and wide instruction issue to make use of instruction level parallelism.

Instruction-level parallelism

  • Instruction-level parallelism (ILP) is a measure of how many of the instructions in a computer program can be executed simultaneously.
  • ILP must not be confused with concurrency, since ILP is about parallel execution of a sequence of instructions belonging to a specific thread of execution of a process (that is a running program with its set of resources - for example its address space, a set of registers, its identifiers, its state, program counter, and more). Conversely, concurrency regards with the threads of one or different processes being assigned to a CPU's core in a strict alternance or in true parallelism if there are enough CPU's cores, ideally one core for each runnable thread. There are two approaches to instruction level parallelism: Hardware and Software.
  • Hardware level works upon dynamic parallelism, whereas the software level works on static parallelism. Dynamic parallelism means the processor decides at run time which instructions to execute in parallel, whereas static parallelism means the compiler decides which instructions to execute in parallel. The Pentium processor works on the dynamic sequence of parallel execution, but the Itanium processor works on the static level parallelism. Consider the following program: 1 e = a + b 2 f = c + d 3 m = e * f
  • Operation 3 depends on the results of operations 1 and 2, so it cannot be calculated until both of them are completed. However, operations 1 and 2 do not depend on any other operation, so they can be calculated simultaneously. If we assume that each operation can be completed in one unit of time then these three instructions can be completed in a total of two units of time, giving an ILP of 3/2(as 1 and 2 can execute concurrently and require 1unit time and instruction 3 require 1unit time).
  • A goal of compiler and processor designers is to identify and take advantage of as much ILP as possible. Ordinary programs are typically written under a sequential execution model where instructions execute one after the other and in the order specified by the programmer. ILP allows the compiler and the processor to overlap the execution of multiple instructions or even to change the order in which instructions are executed. Micro-architectural techniques that are used to exploit ILP include:
  • Instruction pipelining where the execution of multiple instructions can be partially overlapped.
  • Superscalar execution, VLIW, and the closely related explicitly parallel instruction computing concepts, in which multiple execution units are used to execute multiple instructions in parallel.
  • Out-of-order execution where instructions execute in any order that does not violate data dependencies. Note that this technique is independent of both pipelining and superscalar execution. Current implementations of out-of-order execution dynamically (i.e., while the program is executing and without any help from the compiler) extract ILP from ordinary programs. An alternative is to extract this parallelism at compile time and somehow convey this information to the hardware. Due to the complexity of scaling the out-of-order execution technique, the industry has re-examined instruction sets which explicitly encode multiple independent operations per instruction.
  • Register renaming which refers to a technique used to avoid unnecessary serialization of program operations imposed by the reuse of registers by those operations, used to enable out-of-order execution.
  • Speculative execution which allows the execution of complete instructions or parts of instructions before being certain whether this execution should take place. A commonly used form of speculative execution is control flow speculation where instructions past a control flow instruction