Prepare for your exams
Get points
Guidelines and tips

Sell on Docsity

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

ADVANCED COMPUTER ARCHITECTURE Notes - Multiprocessors and Thread-Level Parallelism, 1, Study notes of Advanced Computer Architecture

Punjab Technical University Advanced Computer Architecture

Notes on: Multiprocessors, Symmetric Shared Memory Architectures, Cache Coherence in Multiprocessors, Basic Schemes for Enforcing Coherence, Snooping, Snooping Protocols, broadcast protocol, Distributed Shared-Memory Architectures, Basic Hardware Primitives, Barrier Synchronization, Larger-Scale Multiprocessors, Simultaneous Multithreading, Superscalar,SMT processors

Typology: Study notes

2010/2011

On special offer

~~60 Points~~

Limited-time offer

Uploaded on 09/01/2011

punjabforever 🇮🇳

4.3

(40)

16 documents

1 / 16

This page cannot be seen from the preview

Don't miss anything!

Multiprocessors and Thread-Level Parallelism

1. Explain Symmetric Shared-Memory Architecture and How to reduce Cache

coherence problem in Symmetric Shared Memory architectures:

Symmetric Shared Memory Architectures:

The Symmetric Shared Memory Architecture consists of several processors with a single

physical memory shared by all processors through a shared bus which is shown below.

` Small-scale shared-memory machines usually support the caching of both shared

and private data. Private data is used by a single processor, while shared data is used by

multiple processors, essentially providing communication among the processors through

reads and writes of the shared data. When a private item is cached, its location is

migrated to the cache, reducing the average access time as well as the memory bandwidth

required. Since no other processor uses the data, the program behavior is identical to that

in a uniprocessor.

On special offer

Partial preview of the text

Download ADVANCED COMPUTER ARCHITECTURE Notes - Multiprocessors and Thread-Level Parallelism, 1 and more Study notes Advanced Computer Architecture in PDF only on Docsity!

Multiprocessors and Thread-Level Parallelism

Explain Symmetric Shared-Memory Architecture and How to reduce Cache coherence problem in Symmetric Shared Memory architectures: Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared by all processors through a shared bus which is shown below. ` Small-scale shared-memory machines usually support the caching of both shared and private data. Private data is used by a single processor, while shared data is used by multiple processors, essentially providing communication among the processors through reads and writes of the shared data. When a private item is cached, its location is migrated to the cache, reducing the average access time as well as the memory bandwidth required. Since no other processor uses the data, the program behavior is identical to that in a uniprocessor.

Cache Coherence in Multiprocessors: Introduction of caches caused a coherence problem for I/O operations, The same problem exists in the case of multiprocessors, because the view of memory held by two different processors is through their individual caches. Figure 6.7 illustrates the problem and shows how two different processors can have two different values for the same location. This difficulty s generally referred to as the cache-coherence problem. Time Event Cache contents for CPU A Cache contents for CPU B Memory contents for location X 0 1 1 CPU A reads X 1 1 2 CPU B reads X 1 1 1 3 CPU A stores 0 into X 0 1 0 FIGURE 6.7 The cache-coherence problem for a single memory location (X), read and written by two processors (A and B). We initially assume that neither cache contains the variable and that X has the value 1. We also assume a write-through cache; a write-back cache adds some additional but similar complications. After the value of X has been written by A, A’s cache and the memory both contain the new value, but B’s cache does not, and if B reads the value of X, it will receive 1! Informally, we could say that a memory system is coherent if any read of a data item returns the most recently written value of that data item. This simple definition contains two different aspects of memory system behavior, both of which are critical to writing correct shared-memory programs. The first aspect, called coherence, defines what values can be returned by a read. The second aspect, called consistency, determines when a written value will be returned by a read. Let’s look at coherence first. A memory system is coherent if 1 A read by a processor, P, to a location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P.

Snooping Protocols The method which ensure that a processor has exclusive access to a data item before it writes that item. This style of protocol is called a write invalidate protocol because it invalidates other copies on a write. It is by far the most common protocol, both for snooping and for directory schemes. Exclusive access ensures that no other readable or writable copies of an item exist when the write occurs: all other cached copies of the item are invalidated. Figure 6.8 shows an example of an invalidation protocol for a snooping bus with write- back caches in action To see how this protocol ensures coherence, consider a write followed by a read by another processor: Since the write requires exclusive access, any copy held by the reading processor must be invalidated (hence the protocol name). Thus, when the read occurs, it misses in the cache and is forced to fetch a new copy of the data. For a write, we require that the writing processor have exclusive access, preventing any other processor from being able to write simultaneously. If two processors do attempt to write the same data simultaneously, one of them wins the race, causing the other processor’s copy to be invalidated. For the other processor to complete its write, it must obtain a new copy of the data, which must now contain the updated value. Therefore, this protocol enforces write serialization. Contents of Contents of Contents of memory Processor activity Bus activity CPU A’s cache CPU B’s cache location X 0 CPU A reads X Cache miss for X 0 0 CPU B reads X Cache miss for X 0 0 0 CPU A writes a 1 to X Invalidation for X 1 0 CPU B reads X Cache miss for X 1 1 1 FIGURE 6.8 An example of an invalidation protocol working on a snooping bus for a single cache block (X) with write-back caches.

The alternative to an invalidate protocol is to update all the cached copies of a data item when that item is written. This type of protocol is called a write update or write broadcast protocol. Figure 6.8 shows an example of a write update protocol in operation. In the decade since these protocols were developed, invalidate has emerged as the winner for the vast majority of designs. Contents of Contents of Contents of memory Processor activity Bus activity CPU A’s cache CPU B’s cache location X 0 CPU A reads X Cache miss for X 0 0 CPU B reads X Cache miss for X 0 0 0 CPU A writes a 1 to X Write broadcast of X

CPU B reads X 1 1 1 FIGURE 6.9 An example of a write update or broadcast protocol working on a snooping bus for a single cache block (X) with write-back caches. The performance differences between write update and write invalidate protocols arise from three characteristics:  Multiple writes to the same word with no intervening reads require multiple write broadcasts in an update protocol, but only one initial invalidation in a write invalidate protocol.  With multiword cache blocks, each word written in a cache block requires a write broadcast in an update protocol, although only the first write to any word in the block needs to generate an invalidate in an invalidation protocol. An invalidation protocol works on cache blocks, while an update protocol must work on individual words (or bytes, when bytes are written). It is possible to try to merge writes in a write broadcast scheme.  The delay between writing a word in one processor and reading the written value in another processor is usually less in a write update scheme, since the written data are immediately updated in the reader’s cache

Explain Distributed Shared-Memory Architectures. There are several disadvantages in Symmetric Shared Memory architectures. First, compiler mechanisms for transparent software cache coherence are very limited. Second, without cache coherence, the multiprocessor loses the advantage of being able to fetch and use multiple words in a single cache block for close to the cost of fetching one word. Third, mechanisms for tolerating latency such as prefetch are more useful when they can fetch multiple words, such as a cache block, and where the fetched data remain coherent; we will examine this advantage in more detail later. These disadvantages are magnified by the large latency of access to remote memory versus a local cache. For these reasons, cache coherence is an accepted requirement in small-scale multiprocessors. For larger-scale architectures, there are new challenges to extending the cache-coherent shared-memory model. Although the bus can certainly be replaced with a more scalable interconnection network and we could certainly distribute the memory so that the memory bandwidth could also be scaled, the lack of scalability of the snooping coherence scheme needs to be addressed is known as Distributed Shared Memory architecture. The first coherence protocol is known as a directory protocol. A directory keeps the state of every block that may be cached. Information in the directory includes which caches have copies of the block, whether it is dirty, and so on. To prevent the directory from becoming the bottleneck, directory entries can be distributed along with the memory, so that different directory accesses can go to different locations, just as different memory requests go to different memories. A distributed directory retains the characteristic that the sharing status of a block is always in a single known location. This property is what allows the coherence protocol to avoid broadcast. Figure 6.27 shows how our distributed-memory multiprocessor looks with the directories added to each node.

FIGURE 6.27 A directory is added to each node to implement cache coherence in a distributed-memory multiprocessor. Directory-Based Cache-Coherence Protocols: The Basics There are two primary operations that a directory protocol must implement: handling a read miss and handling a write to a shared, clean cache block. (Handling a write miss to a shared block is a simple combination of these two.) To implement these operations, a directory must track the state of each cache block. In a simple protocol, these states could be the following: Shared —One or more processors have the block cached, and the value in memory is up to date (as well as in all the caches) Uncached —No processor has a copy of the cache block Exclusive —Exactly one processor has a copy of the cache block and it has written the block, so the memory copy is out of date. The processor is called the owner of the block. In addition to tracking the state of each cache block, we must track the processors that have copies of the block when it is shared, since they will need to be invalidated on a write. The simplest way to do this is to keep a bit vector for each memory block. When the block is shared, each bit of the vector indicates whether the corresponding processor has a copy of that block. We can also use the bit vector to keep track of the owner of the block when the block is in the exclusive state. For efficiency reasons, we also track the

What is Synchronization? Explain various Hardware Primitives Synchronization Synchronization mechanisms are typically built with user-level software routines that rely on hardware-supplied synchronization instructions. The efficient spin locks can be built using a simple hardware synchronization instruction and the coherence mechanism. Basic Hardware Primitives The key ability we require to implement synchronization in a multiprocessor is a set of hardware primitives with the ability to atomically read and modify a memory location. Without such a capability, the cost of building basic synchronization primitives will be too high and will increase as the processor count increases. There are a number of alternative formulations of the basic hardware primitives, all of which provide the ability to atomically read and modify a location, together with some way to tell if the read and write were performed atomically. These hardware primitives are the basic building blocks that are used to build a wide variety of user-level synchronization operations, including things such as locks and barriers. One typical operation for building synchronization operations is the atomic exchan ge, which interchanges a value in a register for a value in memory. Use this to build a basic synchronization operation, assume that we want to build a simple lock where the value 0 is used to indicate that the lock is free and a 1 is used to indicate that the lock is unavailable. A processor tries to set the lock by doing an exchange of 1, which is in a register, with the memory address corresponding to the lock. The value returned from the exchange instruction is 1 if some other processor had already claimed access and 0 otherwise. In the latter case, the value is also changed to be 1, preventing any competing exchange from also retrieving a 0. There are a number of other atomic primitives that can be used to implement synchronization. They all have the key property that they read and update a memory value in such a manner that we can tell whether or not the two operations executed atomically. One operation, present in many older multiprocessors, is test-and-set, which tests a value and sets it if the value passes the test. For example, we could define an

operation that tested for 0 and set the value to 1, which can be used in a fashion similar to how we used atomic exchange. Another atomic synchronization primitive is fetch-and-increment: it returns the value of a memory location and atomically increments it. By using the value 0 to indicate that the synchronization variable is unclaimed, we can use fetch-and-increment, just as we used exchange. There are other uses of operations like fetch-and-increment. Implementing Locks Using Coherence We can use the coherence mechanisms of a multiprocessor to implement spin locks: locks that a processor continuously tries to acquire, spinning around a loop until it succeeds. Spin locks are used when a programmer expects the lock to be held for a very short amount of time and when she wants the process of locking to be low latency when the lock is available. Because spin locks tie up the processor, waiting in a loop for the lock to become free, they are inappropriate in some circumstances. The simplest implementation, which we would use if there were no cache coherence, would keep the lock variables in memory. A processor could continually try to acquire the lock using an atomic operation, say exchange, and test whether the exchange returned the lock as free. To release the lock, the processor simply stores the value 0 to the lock. Here is the code sequence to lock a spin lock whose address is in R1 using an atomic exchange: DADDUI R2,R0,# lockit: EXCH R2,0(R1) ; atomic exchange BNEZ R2,lockit ; already locked? If our multiprocessor supports cache coherence, we can cache the locks using the coherence mechanism to maintain the lock value coherently. Caching locks has two advantages. First, it allows an implementation where the process of “spinning” (trying to test and acquire the lock in a tight loop) could be done on a local cached copy rather than requiring a global memory access on each attempt to acquire the lock. The second advantage comes from the observation that there is often locality in lock accesses: that is, the processor that used the lock last will use it again in the near future. In such cases, the

Another technique for implementing locks is to use queuing locks. Queuing locks work by constructing a queue of waiting processors; whenever a processor frees up the lock, it causes the next processor in the queue to attempt access. This eliminates contention for a lock when it is freed. We show how queuing locks operate in the next section using a hardware implementation, but software implementations using arrays can achieve most of the same benefits Before we look at hardware primitives, Hardware Primitives In this section we look at two hardware synchronization primitives. The first primitive deals with locks, while the second is useful for barriers and a number of other user-level operations that require counting or supplying distinct indices. In both cases we can create a hardware primitive where latency is essentially identical to our earlier version, but with much less serialization, leading to better scaling when there is contention. The major problem with our original lock implementation is that it introduces a large amount of unneeded contention. For example, when the lock is released all processors generate both a read and a write miss, although at most one processor can successfully get the lock in the unlocked state. This sequence happens on each of the 20 lock/unlock sequences. We can improve this situation by explicitly handing the lock from one waiting processor to the next. Rather than simply allowing all processors to compete every time the lock is released, we keep a list of the waiting processors and hand the lock to one explicitly, when its turn comes. This sort of mechanism has been called a queuing lock. Queuing locks can be implemented either in hardware, or in software using an array to keep track of the waiting processes.

What is Multithreading? How to exploiting TLP. Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading allows multiple threads to share the functional units of a single processor in an overlapping fashion. To permit this sharing, the processor must duplicate the independent state of each thread. For example, a separate copy of the register file, a

separate PC, and a separate page table are required for each thread. There are two main approaches to multithreading. Fine-grained multithreading switches between threads on each instruction, causing the execution of multiples threads to be interleaved. This interleaving is often done in a round-robin fashion, skipping any threads that are stalled at that time. Coarse-grained multithreading was invented as an alternative to fine-grained multithreading. Coarse-grained multithreading switches threads only on costly stalls, such as level two cache misses. This change relieves the need to have thread-switching be essentially free and is much less likely to slow the processor down, since instructions from other threads will only be issued, when a thread encounters a costly stall. Simultaneous Multithreading: Converting Thread-Level Parallelism into Instruction- Level Parallelism: Simultaneous multithreading (SMT) is a variation on multithreading that uses the resources of a multiple issue, dynamically-scheduled processor to exploit TLP at the same time it exploits ILP. The key insight that motivates SMT is that modern multiple- issue processors often have more functional unit parallelism available than a single thread can effectively use. Furthermore, with register renaming and dynamic scheduling, multiple instructions from independent threads can be issued without regard to the dependences among them; the resolution of the dependences can be handled by the dynamic scheduling capability. Figure 6.44 conceptually illustrates the differences in a processor’s ability to exploit the resources of a superscalar for the following processor configurations: n a superscalar with no multithreading support, n a superscalar with coarse-grained multithreading, n a superscalar with fine-grained multithreading, and n a superscalar with simultaneous multithreading.

Design Challenges in SMT processors There are a variety of design challenges for an SMT processor, including:  Dealing with a larger register file needed to hold multiple contexts,  Maintaining low overhead on the clock cycle, particularly in critical steps such as instruction issue, where more candidate instructions need to be considered, and in instruction completion, where choosing what instructions to commit may be challenging, and  Ensuring that the cache conflicts generated by the simultaneous execution of multiple threads do not cause significant performance degradation. In viewing these problems, two observations are important. In many cases, the potential performance overhead due to multithreading is small, and simple choices work well enough. Second, the efficiency of current super-scalars is low enough that there is room for significant improvement, even at the cost of some overhead.

ADVANCED COMPUTER ARCHITECTURE Notes - Multiprocessors and Thread-Level Parallelism, 1, Study notes of Advanced Computer Architecture

Related documents

Partial preview of the text

Download ADVANCED COMPUTER ARCHITECTURE Notes - Multiprocessors and Thread-Level Parallelism, 1 and more Study notes Advanced Computer Architecture in PDF only on Docsity!