Gpu

GPU

Terms

  1. thread

    1. The thread is an abstract entity that represents the execution of the kernel
  2. thread block

    1. a programming abstraction that represents a group of threads that can be executed serially or in parallel
    2. For better process and data mapping
    3. a block contains 512 or 1024 threads
    4. The threads in the same thread block
      1. run on the same stream processor
      2. can communicate with each other via
        1. shared memory,
        2. barrier synchronization
        3. or other synchronization primitives such as atomic operations.
  3. grid

    1. Multiple blocks are combined to form a grid
    2. All the blocks in the same grid contain the same number of threads
  4. stream processor, SM or Streaming Multiprocessors

    1. general purpose processors with a low clock rate target and a small cache

    2. An SM is able to execute several thread blocks in parallel

    3. As soon as one of its thread block has completed execution, it takes up the serially next thread block

    4. SMs support instruction-level parallelism but not branch prediction

    5. an SM contains the following:[8]

      1. Execution cores. (single precision floating-point units, double precision floating-point units, special function units (SFUs)).
      2. Caches:
        1. L1 cache. (for reducing memory access latency).
        2. Shared memory. (for shared data between threads).
        3. Constant cache (for broadcasting of reads from a read-only memory).
        4. Texture cache. (for aggregating bandwidth from texture memory).
      • Schedulers for warps. (these are for issuing instructions to warps based on particular scheduling policies).
      • A substantial number of registers. (an SM may be running a large number of active threads at a time, so it is a must to have registers in thousands.)
    6. The hardware schedules thread blocks to an SM.

      1. In general an SM can handle multiple thread blocks at the same time.
      2. An SM may contains up to 8 thread blocks in total.
      3. A thread ID is assigned to a thread by its respective SM.
      4. Whenever an SM executes a thread block, all the threads inside the thread block are executed at the same time.
        1. Hence to free a memory of a thread block inside the SM, it is critical that the entire set of threads in the block have concluded execution.
        2. Each thread block is divided in scheduled units known as a warp.
      5. The warp scheduler of SM decides which of the warp gets prioritized during issuance of instructions.[11]

    img

  5. CUDA

  6. kernel

    1. [A kernel is a small program or a function
    2. kernel](https://en.wikipedia.org/wiki/Compute_kernel) is executed with the aid of threads
  7. warps

    1. On the hardware side, a thread block is composed of ‘warps’.
    2. A warp is a set of 32 threads within a thread block such that all the threads in a warp execute the same instruction. These threads are selected serially by the SM.
    3. Once a thread block is launched on a multiprocessor (SM), all of its warps are resident until their execution finishes.
      1. Thus a new block is not launched on an SM until there is sufficient number of free registers for all warps of the new block, and until there is enough free shared memory for the new block.
    4. Consider a warp of 32 threads executing an instruction. If one or both of its operands are not ready (e.g. have not yet been fetched from global memory), a process called ‘context switching’ takes place which transfers control to another warp.[12]
    5. When switching away from a particular warp, all the data of that warp remains in the register file so that it can be quickly resumed when its operands become ready. When an instruction has no outstanding data dependencies, that is, both of its operands are ready, the respective warp is considered to be ready for execution. If more than one warps are eligible for execution, the parent SM uses a warp scheduling policy for deciding which warp gets the next fetched instruction.
    6. Different policies for scheduling warps that are eligible for execution are discussed below:[13]
      1. Round Robin (RR) - Instructions are fetched in round robin manner. RR makes sure - SMs are kept busy and no clock cycles are wasted on memory latencies.
      2. Least Recently Fetched (LRF) - In this policy, warp for which instruction has not been fetched for the longest time gets priority in the fetching of an instruction.
      3. Fair (FAIR)[13] - In this policy, the scheduler makes sure all warps are given ‘fair’ opportunity in the number of instruction fetched for them. It fetched instruction to a warp for which minimum number of instructions have been fetched.
      4. Thread block-based CAWS[14] (criticality aware warp scheduling) - The emphasis of this scheduling policy is on improving the execution time of the thread blocks. It allocated more time resources to the warp that shall take the longest time to execute. By giving priority to the most critical warp, this policy allows thread blocks to finish faster, such that the resources become available quicker.
    7. Traditional CPU thread context “switching” requires saving and restoring allocated register values and the program counter to off-chip memory (or cache) and is therefore a much more heavyweight operation than with warp context switching.
      1. All of a warp’s register values (including its program counter) remain in the register file, and the shared memory (and cache) remain in place too since these are shared between all the warps in the thread block.
    8. In order to take advantage of the warp architecture, programming languages and developers need to understand how to coalesce memory accesses and how to manage control flow divergence.
      1. If each thread in a warp takes a different execution path or if each thread accesses significantly divergent memory then the benefits of the warp architecture are lost and performance will significantly degrade.

    File:Warp-Scheduler-Gpu.jpg

Thread

Thread Hierarchy in CUDA Programming[4]

img

__global__
void vecAddKernel (float *A , float *B , float * C , int n)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    if (index < n)
    {
        C[index] = A[index] + B[index] ;
    }
}

Hardware perspective

img

ref

  1. https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming)
  2. http://people.maths.ox.ac.uk/~gilesm/old/pp10/lec1.pdf

Powered by Jekyll and Theme by solid

本站总访问量