






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
THIS MATERIAL IS HELPING THE ASPIRANTS WHO NEED TO GET THE KNOWLEDGE. IT ALSO HELPFUL TO ALL THE STUDDENTS, ACADEMITIONS WHO WANT TO REFERE THE CONCEPTS SUCH AS COMPUTING ARCHITECTURE AND MANY MORE. THRUGH THESE MATERIALS
Typology: Cheat Sheet
1 / 10
This page cannot be seen from the preview
Don't miss anything!
I (^) Data Parallel: performing a repeated operation (or chain of operation) over vectors of data. I (^) Conventionally expressed as a loop, but implementations can be constructed to perform loop operations as a single operation. I (^) Operations can be conditional on elements (see a2 assignment). I (^) Non-unit strides are often used (second example).
Examples of data parallelism:
∀i ∈ 0 ..n a 1 (i) = b 1 (i) + c 1 (i) if (b 2 (i) 6 = 0 ) → a 2 (i) = b 2 (i) + 4 a 3 (i) = b 3 (i) + c 3 (i + 1 )
∀i ∈ 0 , 2 , 4 · · · n a 4 (i) = b 4 (i) + c 4 (i)
Three basic solutions:
I (^) Vector processors
I (^) SIMD processors
I (^) GPU processors
Contrary to the textbook prose, vector processors do not predate SIMD processors by 30 years; the ILLIAC-IV was completed in 1966 (predating vector processors)!! Of course the embedding of SIMD in x86 occurred much later.
Interestingly, the Burroughs BSP was effectively the successor to the ILLIAC IV. It had 16 arithmetic units and 17 memory units to facilitate parallel access across a broader range of stride lengths.
Main memory
Vector registers
Scalar registers
FP add/subtract
FP multiply
FP divide
Integer
Logical
Vector load/store
I (^) Vector registers: high port count to allow multiple concurrent reads/writes each cycle
I (^) Vector functional units: usually heavily pipelined to permit the processing of a new operation each cycle
I (^) Vector load/store unit: to feed the beast these are also pipelined to get data into/outof the core
I (^) Scalar registers: integer and floating point
(a) Element group (b)
DS
DS
DS
PU MM m
n
2
1
n
SM
IS
IS
I (^) GPUs provide multiple types of parallelism that was originally developed for processing the vectors and vector operations commonly found in graphics processing. I (^) Processing with GPUs if often considered a heterogeneous processing platform. I (^) CUDA/OpenCL: two models for programming heterogeneous systems (CUDA is specifically for NVIDIA GPGPUs, highly successful; OpenCL is more general and intended to support heterogeneous parallelism; OpenCL community is fractured and only Intel is really supporting). I (^) ROCm: AMDs continuation of OpenCL (supporting version 2.0; working on version 3.0); supposed to be general purpose and work on a multitude of GPGPU hardware I (^) Metal: Apple proprietary GPU programming (2014) setup to support GPGPU programming (esp machine learning, image processing, neural networks). I (^) The real challenge is planning the migration of data into/outof the GPGPU. Often the GPGPU has limited memory space and feeding the beast becomes an issue.