Tuesday, May 7, 2013

Custom kernels in GPULib 1.6

One of the new features of GPULib 1.6 is the ability to load and execute custom CUDA kernels at runtime. This means you can run CUDA code does not know about GPULib (or even IDL). Let's run through what is required to do this for a simple example.

Our example will simply add two n element arrays and store the result in a third array. The kernel for this will be written in a file `vectorAdd_kernel.cu` and should be straight-forward for anyone writing CUDA:

  extern "C" __global__ void VecAdd_kernel(const float* A,
                                           const float* B,
                                           float* C,
                                           int N)
      int i = blockDim.x * blockIdx.x + threadIdx.x;
      if (i < N)
          C[i] = A[i] + B[i];
This code is then compiled to PTX code using the --ptx flag of nvcc:
  $ nvcc --ptx vectorAdd_kernel.cu
This produces the file `vectorAdd_Kernel.ptx` which can be loaded and executed with a couple of new GPULib routines. First, load the source code of the kernel:
  ptx_source = gpu_read_ptxfile('vectorAdd_kernel.ptx')
This simply loads the source code from the PTX file into an IDL string. The source code could contain multiple kernel definitions; the module represents the entirety of the code, while the function represents one particular kernel. We call GPULib routines to load the module and function:
  module = gpuLoadModule(ptx_source, error=err)
  kernel = gpuLoadFunction(module, 'VecAdd_kernel', error=err)
Next, we create two arrays to add and a third array to contain the result:
  n = 20L
  dx = gpuFindgen(n)
  dy = gpuFindgen(n)
  dz = gpuFltarr(n)
Lastly, we are ready to call our kernel. The grid and block structure is set through the GRID and BLOCKS keywords, while otherwise we just pass the kernel and its arguments to GPUEXECUTEFUNCTION:
  nThreadsPerBlock = 256L
  nBlocksPerGrid = (n + nThreadsPerBlock - 1L) / nThreadsPerBlock

  gpuExecuteFunction, kernel, $
                      dx->_getHandle(), $
                      dy->_getHandle(), $
                      dz->_getHandle(), $
                      n, $
                      GRID=[nBlocksPerGrid], $
                      BLOCKS=[nThreadsPerBlock], $
There is a SHARED_MEMORY keyword to pass the amount of shared memory that your kernel requires, but that is not needed for our simple example.

Friday, May 3, 2013

Linear algebra in GPULib 1.6

GPULib 1.6 adds 100+ linear algebra routines available through MAGMA, a GPU accelerated LAPACK implementation.

The IDL bindings for the low-level MAGMA routines are automatically generated wrappers for the standard LAPACK interface, but GPULib also offers a higher-level routine mimicking the LA_ routines present in the IDL library. GPUINVERT is comparable to LA_INVERT, inverting a non-singular matrix in a single call:

  dinverse = gpuInvert(da, lhs=dinverse)
While these routines document their technique for performing their operations, they do not require much user knowledge of the underlying techniques used. More high-level routines similar to the LA_ are planned for future GPULib versions.

Calling the low-level routines directly in MAGMA requires more knowledge of the mathematical algorithms and a bit more cumbersome notation. For example, the following example shows how to call the MAGMA version of SGELS:

  status = gpusgels((byte('N'))[0], $
                    m, n, nrhs, $
                    da->_getHandle(), lda, $
                    db->_getHandle(), ldb, $
                    work, lwork, $
SGELS solves overdetermined or underdetermined real linear systems using a QR or LQ factorization of a matrix of full rank. This low-level operation has options for performing multiple solves in a call and operating on the transpose.

The calling syntax for the MAGMA routines is in the API documentation for GPULib as well as in lib/magma_routines.h in the GPULib distribution.

Wednesday, May 1, 2013

GPULib 1.6 released

Now available for download from the Tech-X website, GPULib 1.6 offers some exciting new features and a lot more stability. Here’s the brief rundown of what's new:

  • All platforms, Windows, Linux, and OS X, are now distributed as binaries. No building from source required!
  • Added wrappers for MAGMA, a GPU accelerated LAPACK library. This makes over 100 low-level linear algebra routines available. In addition to these low-level routines, we have one high-level routine, GPUINVERT, that is similar to the IDL routine LA_INVERT. We intend to write analogs to the rest of the LA_ IDL routines as well. This feature requires the Basic membership plan.
  • GPULib can now load and execute custom CUDA kernels without having to link to GPULib; you just compile your kernel to a .ptx file. GPULib provides routines to load and execute that kernel at runtime. This feature requires the Basic membership plan.
  • Support for CUDA 5.0.
  • Added support for up to 8-dimensional arrays. This is basic support for higher-dimensional arrays in GPULib, not every routine understands them yet.
  • Added optimized scalar/array operations. No more tricks just to add a scalar to an array!
  • Fixed bugs in GPUHISTOGRAM and GPUHIST_2D.
  • Fixed bugs in GPUATAN2 for complex and double complex.
  • Fixed bugs in GPUREAL.

I have planned a few more posts with details and examples about some of the big features above, like the MAGMA wrappers and the custom kernels.

A lot of work was done on infrastructure to make releasing an easier process, hopefully resulting in more frequent updates in the future. We have plans for some very exciting features in the coming year!