Tuesday, May 7, 2013

Custom kernels in GPULib 1.6

One of the new features of GPULib 1.6 is the ability to load and execute custom CUDA kernels at runtime. This means you can run CUDA code does not know about GPULib (or even IDL). Let's run through what is required to do this for a simple example.

Our example will simply add two n element arrays and store the result in a third array. The kernel for this will be written in a file `vectorAdd_kernel.cu` and should be straight-forward for anyone writing CUDA:

  extern "C" __global__ void VecAdd_kernel(const float* A,
                                           const float* B,
                                           float* C,
                                           int N)
  {
      int i = blockDim.x * blockIdx.x + threadIdx.x;
      if (i < N)
          C[i] = A[i] + B[i];
  }
This code is then compiled to PTX code using the --ptx flag of nvcc:
  $ nvcc --ptx vectorAdd_kernel.cu
This produces the file `vectorAdd_Kernel.ptx` which can be loaded and executed with a couple of new GPULib routines. First, load the source code of the kernel:
  ptx_source = gpu_read_ptxfile('vectorAdd_kernel.ptx')
This simply loads the source code from the PTX file into an IDL string. The source code could contain multiple kernel definitions; the module represents the entirety of the code, while the function represents one particular kernel. We call GPULib routines to load the module and function:
  module = gpuLoadModule(ptx_source, error=err)
  kernel = gpuLoadFunction(module, 'VecAdd_kernel', error=err)
Next, we create two arrays to add and a third array to contain the result:
  n = 20L
  dx = gpuFindgen(n)
  dy = gpuFindgen(n)
  dz = gpuFltarr(n)
Lastly, we are ready to call our kernel. The grid and block structure is set through the GRID and BLOCKS keywords, while otherwise we just pass the kernel and its arguments to GPUEXECUTEFUNCTION:
  nThreadsPerBlock = 256L
  nBlocksPerGrid = (n + nThreadsPerBlock - 1L) / nThreadsPerBlock

  gpuExecuteFunction, kernel, $
                      dx->_getHandle(), $
                      dy->_getHandle(), $
                      dz->_getHandle(), $
                      n, $
                      GRID=[nBlocksPerGrid], $
                      BLOCKS=[nThreadsPerBlock], $
                      ERROR=err
There is a SHARED_MEMORY keyword to pass the amount of shared memory that your kernel requires, but that is not needed for our simple example.

No comments: