A wave simulator I've been working on with C# + Cudafy (C# -> CUDA or OpenCL translator) works great, except for the fact that running the OpenCL CPU version (Intel driver, 15" MacBook Pro Retina i7 2.7GHz, GeForce 650M (Kepler, 384 cores)) is roughly four times as fast as the GPU version. (This happens whether I use the CL or CUDA GPU backend. The OpenCL GPU and CUDA versions perform nearly identically.) To clarify, for a sample problem: <ul> <li>OpenCL CPU 1200 Hz</li> <li>OpenCL GPU 320 Hz </li> <li>CUDA GPU -~330 Hz</li> </ul> I'm at a loss to explain why the CPU version would be faster than the GPU. In this case, the kernel code that's executing (in the CL case) on the CPU and GPU is identical. I select either the CPU or GPU device during initialization, but beyond that, everything is identical. Edit Here's the C# code that launches one of the kernels. (The others are very similar.) <pre class="prettyprint"><code> public override void UpdateEz(Source source, float Time, float ca, float cb) { var blockSize = new dim3(1); var gridSize = new dim3(_gpuEz.Field.GetLength(0),_gpuEz.Field.GetLength(1)); Gpu.Launch(gridSize, blockSize) .CudaUpdateEz( Time , ca , cb , source.Position.X , source.Position.Y , source.Value , _gpuHx.Field , _gpuHy.Field , _gpuEz.Field ); } </code></pre> And, here's the relevant CUDA kernel function generated by Cudafy: <pre class="prettyprint"><code>extern "C" __global__ void CudaUpdateEz(float time, float ca, float cb, int sourceX, int sourceY, float sourceValue, float* hx, int hxLen0, int hxLen1, float* hy, int hyLen0, int hyLen1, float* ez, int ezLen0, int ezLen1) { int x = blockIdx.x; int y = blockIdx.y; if (x > 0 && x < ezLen0 - 1 && y > 0 && y < ezLen1 - 1) { ez[(x) * ezLen1 + ( y)] = ca * ez[(x) * ezLen1 + ( y)] + cb * (hy[(x) * hyLen1 + ( y)] - hy[(x - 1) * hyLen1 + ( y)]) - cb * (hx[(x) * hxLen1 + ( y)] - hx[(x) * hxLen1 + ( y - 1)]); } if (x == sourceX && y == sourceY) { ez[(x) * ezLen1 + ( y)] += sourceValue; } } </code></pre> Just for completeness, here's the C# that is used to generate the CUDA: <pre class="prettyprint"><code> [Cudafy] public static void CudaUpdateEz( GThread thread , float time , float ca , float cb , int sourceX , int sourceY , float sourceValue , float[,] hx , float[,] hy , float[,] ez ) { var i = thread.blockIdx.x; var j = thread.blockIdx.y; if (i > 0 && i < ez.GetLength(0) - 1 && j > 0 && j < ez.GetLength(1) - 1) ez[i, j] = ca * ez[i, j] + cb * (hy[i, j] - hy[i - 1, j]) - cb * (hx[i, j] - hx[i, j - 1]) ; if (i == sourceX && j == sourceY) ez[i, j] += sourceValue; } </code></pre> Obviously, the <code>if</code> in this kernel is bad, but even the resulting pipeline stall shouldn't cause such an extreme performance delta. The only other thing that jumps out at me is that I'm using a lame grid/block allocation scheme - ie, the grid is the size of the array to be updated, and each block is one thread. I'm sure this has some impact on performance, but I can't see it causing it to be 1/4th of the speed of the CL code running on the CPU. ARGH!

Answering this to get it off the unanswered list. The code posted indicates that the kernel launch is specifying a threadblock of 1 (active) thread. This is not the way to write fast GPU code, as it will leave most of the GPU capability idle. Typical threadblock sizes should be at least 128 threads per block, and higher is often better, in multiples of 32, up to the limit of 512 or 1024 per block, depending on GPU. The GPU "likes" to hide latency by having a lot of parallel work "available". Specifying more threads per block assists with this goal. (Having a reasonably large number of threadblocks in the grid may also help.) Furthermore the GPU executes threads in groups of 32. Specifying only 1 thread per block or a non-multiple of 32 will leave some idle execution slots, in every threadblock that gets executed. 1 thread per block is particularly bad.

Cuda - OpenCL CPU 4x faster than OpenCL or CUDA GPU version

Tags:

c#

cuda

opencl

cudafy.net

A wave simulator I've been working on with C# + Cudafy (C# -> CUDA or OpenCL translator) works great, except for the fact that running the OpenCL CPU version (Intel driver, 15" MacBook Pro Retina i7 2.7GHz, GeForce 650M (Kepler, 384 cores)) is roughly four times as fast as the GPU version.

(This happens whether I use the CL or CUDA GPU backend. The OpenCL GPU and CUDA versions perform nearly identically.)

To clarify, for a sample problem:

OpenCL CPU 1200 Hz
OpenCL GPU 320 Hz
CUDA GPU -~330 Hz

I'm at a loss to explain why the CPU version would be faster than the GPU. In this case, the kernel code that's executing (in the CL case) on the CPU and GPU is identical. I select either the CPU or GPU device during initialization, but beyond that, everything is identical.

Edit

Here's the C# code that launches one of the kernels. (The others are very similar.)

    public override void UpdateEz(Source source, float Time, float ca, float cb)
    {
        var blockSize = new dim3(1);
        var gridSize = new dim3(_gpuEz.Field.GetLength(0),_gpuEz.Field.GetLength(1));

        Gpu.Launch(gridSize, blockSize)
            .CudaUpdateEz(
                Time
                , ca
                , cb
                , source.Position.X
                , source.Position.Y
                , source.Value
                , _gpuHx.Field
                , _gpuHy.Field
                , _gpuEz.Field
            );

    }

And, here's the relevant CUDA kernel function generated by Cudafy:

extern "C" __global__ void CudaUpdateEz(float time, float ca, float cb, int sourceX, int sourceY, float sourceValue,  float* hx, int hxLen0, int hxLen1,  float* hy, int hyLen0, int hyLen1,  float* ez, int ezLen0, int ezLen1)
{
    int x = blockIdx.x;
    int y = blockIdx.y;
    if (x > 0 && x < ezLen0 - 1 && y > 0 && y < ezLen1 - 1)
    {
        ez[(x) * ezLen1 + ( y)] = ca * ez[(x) * ezLen1 + ( y)] + cb * (hy[(x) * hyLen1 + ( y)] - hy[(x - 1) * hyLen1 + ( y)]) - cb * (hx[(x) * hxLen1 + ( y)] - hx[(x) * hxLen1 + ( y - 1)]);
    }
    if (x == sourceX && y == sourceY)
    {
        ez[(x) * ezLen1 + ( y)] += sourceValue;
    }
}

Just for completeness, here's the C# that is used to generate the CUDA:

    [Cudafy]
    public static void CudaUpdateEz(
        GThread thread
        , float time
        , float ca
        , float cb
        , int sourceX
        , int sourceY
        , float sourceValue
        , float[,] hx
        , float[,] hy
        , float[,] ez
        )
    {
        var i = thread.blockIdx.x;
        var j = thread.blockIdx.y;

        if (i > 0 && i < ez.GetLength(0) - 1 && j > 0 && j < ez.GetLength(1) - 1)
            ez[i, j] =
                ca * ez[i, j]
                +
                cb * (hy[i, j] - hy[i - 1, j])
                -
                cb * (hx[i, j] - hx[i, j - 1])
                ;

        if (i == sourceX && j == sourceY)
            ez[i, j] += sourceValue;
    }

Obviously, the if in this kernel is bad, but even the resulting pipeline stall shouldn't cause such an extreme performance delta.

The only other thing that jumps out at me is that I'm using a lame grid/block allocation scheme - ie, the grid is the size of the array to be updated, and each block is one thread. I'm sure this has some impact on performance, but I can't see it causing it to be 1/4th of the speed of the CL code running on the CPU. ARGH!

855

asked May 07 '13 23:05

3Dave

1 Answers

Answering this to get it off the unanswered list.

The code posted indicates that the kernel launch is specifying a threadblock of 1 (active) thread. This is not the way to write fast GPU code, as it will leave most of the GPU capability idle.

Typical threadblock sizes should be at least 128 threads per block, and higher is often better, in multiples of 32, up to the limit of 512 or 1024 per block, depending on GPU.

The GPU "likes" to hide latency by having a lot of parallel work "available". Specifying more threads per block assists with this goal. (Having a reasonably large number of threadblocks in the grid may also help.)

Furthermore the GPU executes threads in groups of 32. Specifying only 1 thread per block or a non-multiple of 32 will leave some idle execution slots, in every threadblock that gets executed. 1 thread per block is particularly bad.

answered Oct 01 '22 01:10

Robert Crovella

Related questions
                            
                                Check range of string values that contains a number
                            
                                C# - Look up a users manager in active directory
                            
                                Looping Through WPF DataGrid Using foreach
                            
                                Type inference on nested generic functions
                            
                                FTPS Server using .NET SslStream
                            
                                Storing settings in Properties.Settings.Default vs the Registry [duplicate]
                            
                                Difference between MarshalAs(UnmanagedType.LPWStr) and Marshal.PtrToStringUni()
                            
                                What's the most efficient way to convert a DataTable to an object[,]?
                            
                                C# PresentViewController to a viewcontroller in storyboard
                            
                                Passing an associative array using json: which type to expect in the controller?
                            
                                Can't access build configuration manager or build configurations in Visual C# 2010 Express
                            
                                IronPython: adding references from host application
                            
                                Search for an Array or List in a List
                            
                                Replacing several nodes in the same tree, using SyntaxNode.ReplaceNode
                            
                                'Add or Remove Programs' icon for a C# ClickOnce application
                            
                                one-to-many projected LINQ query executes repeatedly
                            
                                Async/Await Iterate over returned Task<IEnumerable<SomeClass>>
                            
                                How to 'await' a WebClient.UploadStringAsync request?
                            
                                Overriding Button in XAML WPF with ControlTemplate does not display content
                            
                                Is there a way through data annotations to verify that one date property is greater than or equal to another date property?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With