A wave simulator I've been working on with C# + Cudafy (C# -> CUDA or OpenCL translator) works great, except for the fact that running the OpenCL CPU version (Intel driver, 15" MacBook Pro Retina i7 2.7GHz, GeForce 650M (Kepler, 384 cores)) is roughly four times as fast as the GPU version.
(This happens whether I use the CL or CUDA GPU backend. The OpenCL GPU and CUDA versions perform nearly identically.)
To clarify, for a sample problem:
I'm at a loss to explain why the CPU version would be faster than the GPU. In this case, the kernel code that's executing (in the CL case) on the CPU and GPU is identical. I select either the CPU or GPU device during initialization, but beyond that, everything is identical.
Edit
Here's the C# code that launches one of the kernels. (The others are very similar.)
public override void UpdateEz(Source source, float Time, float ca, float cb)
{
var blockSize = new dim3(1);
var gridSize = new dim3(_gpuEz.Field.GetLength(0),_gpuEz.Field.GetLength(1));
Gpu.Launch(gridSize, blockSize)
.CudaUpdateEz(
Time
, ca
, cb
, source.Position.X
, source.Position.Y
, source.Value
, _gpuHx.Field
, _gpuHy.Field
, _gpuEz.Field
);
}
And, here's the relevant CUDA kernel function generated by Cudafy:
extern "C" __global__ void CudaUpdateEz(float time, float ca, float cb, int sourceX, int sourceY, float sourceValue, float* hx, int hxLen0, int hxLen1, float* hy, int hyLen0, int hyLen1, float* ez, int ezLen0, int ezLen1)
{
int x = blockIdx.x;
int y = blockIdx.y;
if (x > 0 && x < ezLen0 - 1 && y > 0 && y < ezLen1 - 1)
{
ez[(x) * ezLen1 + ( y)] = ca * ez[(x) * ezLen1 + ( y)] + cb * (hy[(x) * hyLen1 + ( y)] - hy[(x - 1) * hyLen1 + ( y)]) - cb * (hx[(x) * hxLen1 + ( y)] - hx[(x) * hxLen1 + ( y - 1)]);
}
if (x == sourceX && y == sourceY)
{
ez[(x) * ezLen1 + ( y)] += sourceValue;
}
}
Just for completeness, here's the C# that is used to generate the CUDA:
[Cudafy]
public static void CudaUpdateEz(
GThread thread
, float time
, float ca
, float cb
, int sourceX
, int sourceY
, float sourceValue
, float[,] hx
, float[,] hy
, float[,] ez
)
{
var i = thread.blockIdx.x;
var j = thread.blockIdx.y;
if (i > 0 && i < ez.GetLength(0) - 1 && j > 0 && j < ez.GetLength(1) - 1)
ez[i, j] =
ca * ez[i, j]
+
cb * (hy[i, j] - hy[i - 1, j])
-
cb * (hx[i, j] - hx[i, j - 1])
;
if (i == sourceX && j == sourceY)
ez[i, j] += sourceValue;
}
Obviously, the if
in this kernel is bad, but even the resulting pipeline stall shouldn't cause such an extreme performance delta.
The only other thing that jumps out at me is that I'm using a lame grid/block allocation scheme - ie, the grid is the size of the array to be updated, and each block is one thread. I'm sure this has some impact on performance, but I can't see it causing it to be 1/4th of the speed of the CL code running on the CPU. ARGH!
A study that directly compared CUDA programs with OpenCL on NVIDIA GPUs showed that CUDA was 30% faster than OpenCL.
CUDA and OpenCL offer two different interfaces for programming GPUs. OpenCL is an open standard that can be used to program CPUs, GPUs, and other devices from different vendors, while CUDA is specific to NVIDIA GPUs.
As we have already stated, the main difference between CUDA and OpenCL is that CUDA is a proprietary framework created by Nvidia and OpenCL is open source.
OpenCL™ (Open Computing Language) is a low-level API for heterogeneous computing that runs on CUDA-powered GPUs. Using the OpenCL API, developers can launch compute kernels written using a limited subset of the C programming language on a GPU.
Answering this to get it off the unanswered list.
The code posted indicates that the kernel launch is specifying a threadblock of 1 (active) thread. This is not the way to write fast GPU code, as it will leave most of the GPU capability idle.
Typical threadblock sizes should be at least 128 threads per block, and higher is often better, in multiples of 32, up to the limit of 512 or 1024 per block, depending on GPU.
The GPU "likes" to hide latency by having a lot of parallel work "available". Specifying more threads per block assists with this goal. (Having a reasonably large number of threadblocks in the grid may also help.)
Furthermore the GPU executes threads in groups of 32. Specifying only 1 thread per block or a non-multiple of 32 will leave some idle execution slots, in every threadblock that gets executed. 1 thread per block is particularly bad.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With