A Template for GPU programming using CUDA tools.

his is a template for programming a nVidia GPU in the AXEL cluster (target is Tsela C1060). The example compute the sum of two linear vector each with N components. Resources related in this document are located in the gpu directory after extracting the axel_templates package.

A thread is created for the summation of each component in the vector. 256 threads are created per block (tpb) and 4 blocks are assigned. The kernel executed by every thread is annotated by the __global__ keyword as shown below.

__global__ void kernel(float *A, float *B, float *C) {
  int i = blockIdx.x * 256 + threadIdx.x;
  if (i < N) // check since some threads may be created uselessly
    C[i] = A[i] + B[i];
}

The index of current thread is computed from the Block index and Thread index. Since we created 4x256 threads, there are some threads may process data outside the defined memory range of the vectors. The if statement is added to prevent this.

The kernel is executed on GPU while the main function is running on host CPU. The following runtime API is used to create and start the threads in host program.

kernel<<<(N+tpb-1)/tpb, tpb>>>(A, B, C);

GPU kernel and CPU main function can coexist in the same source code file which with a *.cu extension. The CUDA compiler, nvcc is used instead of gcc to generate a single binary executable.