Monday, December 10, 2012

AGU 2012 poster

Here is the abstract for my poster at AGU this year, “GPU accelerated curve fitting with IDL”:

Curve fitting is a common mathematical calculation done in all scientific areas. The Interactive Data Language (IDL) is also widely used in this community for data analysis and visualization. We are creating a general-purpose, GPU accelerated curve fitting library for use from within IDL.

We have developed GPULib, a library of routines in IDL for accelerating common scientific operations including arithmetic, FFTs, interpolation, and others. These routines are accelerated using modern GPUs using NVIDIA’s CUDA architecture. We will add curve fitting routines to the GPULib library suite, making curve fitting much faster.

In addition, library routines required for efficient curve fitting will also be generally useful to other users of GPULib. In particular, a GPU accelerated LAPACK implementation such as MAGMA is required for the Levenberg- Marquardt curve fitting and is commonly used in many other scientific computations. Furthermore, the ability to evaluate custom expressions at runtime necessary for specifying a function model will be useful for users in all areas.

The poster is available to download on the AGU website.

Monday, August 27, 2012

Experiments with OpenCL

I've been looking at what it would take to convert GPULib from CUDA to OpenCL. In the effort, I have created a simple DLM that provides some basic arithmetic operations and some more advanced features. For example, we can create variables on the GPU in standard ways:

IDL> dx = cl_findgen(10)
IDL> dresult = cl_fltarr(10)
We can also transfer data from the CPU to the GPU:
IDL> hy = 2.0 * findgen(10)
IDL> dy = cl_putvar(hy)
Performing a simple operation is easy (though I will eventually change the form of this call to return the value of the result):
IDL> status = cl_add(dx, dy, dresult)
Retrieve the result:
IDL> result = cl_getvar(dresult)
I have implemented memory operations for allocating, transferring data, and freeing memory. I also have routines for obtaining metadata on variables, as well as the available OpenCL devices and platforms.

OpenCL kernels are always constructed from strings, so it is easy to make custom kernels on-the-fly to evaluate user entered expressions. In the next example we will evaluate the expression:

dz = 2. * dx + sin(dy)
First, allocate and initialize the inputs/outputs:
dx = cl_findgen(10)
dy = cl_putvar(2. * findgen(n))
dz = cl_fltarr(10, /nozero)
Next, create the kernel using a string representing the expression, a string array of the variable names in the expression, and an array of type codes of the variables in the expression:
kernel = cl_compile('z[i] = 2. * x[i] + sin(y[i])', $
                    ['x', 'y', 'z'], $
                    lonarr(3) + 4L, $
The SIMPLE keyword indicates that we are just filling in an expression in a simple element-wise kernel. To run the kernel on our data, call CL_EXECUTE with the kernel identifier and a structure containing the values for the variables in the expression:
status = cl_execute(kernel, { x: dx, y: dy, z: dz })
Finally, retrieve the results as before:
z = cl_getvar(dz)
Without the SIMPLE keyword set, the user can specify the entire full code for the kernel.

There are some great advantages to using OpenCL:

  1. OpenCL is an open standard.
  2. OpenCL is not limited to NVIDIA GPUs (or even GPUs at all, OpenCL can run on various platforms including CPUs).
  3. It is easy to create on-the-fly kernels.
  4. OpenCL uses the same concepts as CUDA, so many kernels can be translated fairly easily.
Unfortunately, there is one big disadvantage right now:
  1. OpenCL libraries are not as far along as CUDA libraries, both those provided by NVIDIA and third-parties.
This would mean that anything beyond basic arithmetic in the current GPULib would be eliminated right now. I don't think it will be long before OpenCL catches up in this regard, but this is a serious issue currently.

Thursday, February 23, 2012

Multiplying a scalar by an array

Currently, you have to be a bit tricky to multiply a scalar by an array. (In the next version of GPULib, there will be a efficient, direct way to do scalar/array operations.)

Most GPULib operations have a short form and a long form. The short form is a simple version of the operation that takes two array arguments. For addition, the short form is equivalent to:

C = A + B
where A, B, and C are all arrays of the same size. In code, this is done with:
c = gpuadd(a, b)
The long form takes five arguments, three scalars and two arrays, in the form of:
C = s1*A + s2*B + s3
where A, B, and C are arrays of the same size and s1, s2, and s3 are scalars. In code, this is:
c = gpuadd(s1, a, s2, b, s3)
If you just want to do a scalar times an array, use the long form with A for both A and B, while also setting b and c to 0.0. For example, to do 2. * findgen(10), do:
IDL> gpuinit
Welcome to GPULib 1.5.0 (Revision: 2199M)
Graphics card: Tesla C2070, compute capability: 2.0, memory: 751 MB available, 1279 MB total
Checking GPU memory errors
IDL> a = findgen(10)
IDL> da = gpuputarr(a)
IDL> dc = gpufltarr(10)
IDL> scalar = 2.0
IDL> ; dc = scalar * da + 0. * da + 0.
IDL> dc = gpuadd(scalar, da, 0., da, 0., lhs=dc) 
IDL> c = gpugetarr(dc)
IDL> print, c
      0.00000      2.00000      4.00000      6.00000      8.00000
      10.0000      12.0000      14.0000      16.0000      18.0000
Note, that most operations have a long form version that performs an affine transform along with the operation.