The variation of cache misses in GPU

Tags:

I have been toying an OpenCL kernel that access 7 global memory buffers, do something on the values and store the result back to a 8th global memory buffer. As I observed, as the input size increases, the L1 cache miss ratio (=misses(misses + hits)) varies a lot. I can't find the source of this variation. The input size here means the number of global work items (a power of 2, and a multiple of workgroup size). The number of workgroup size remains 256.

These are the results. These show the L1 cache miss ratio. Starting from 4096 work-items (16 workgroups).

0.677125
0.55946875
0.345994792
0.054078125
0.436167969
0.431871745
0.938546224
0.959258789
0.952941406
0.955016479

The profiler says it uses 18 registers per thread. Here is the code (the function TTsum() is supposed to do just a bunch of dependent transcendent operations, so it has nothing to do with caches I guess) :

float TTsum(float x1, float x2, float x3, float x4, float x5, float x6, float x7)
{
        float temp = 0;
        for (int j = 0; j < 2; j++)
                temp = temp +  x1 + (float)x2 + x3 + x4 + x5 + x6 + x7;
        temp = sqrt(temp);
        temp = exp(temp);
        temp = temp / x1;
        temp = temp / (float)x2;
        for (int j = 0; j < 20; j++) temp = sqrt(temp);
        return temp;
}

__kernel void histogram(__global float* x1,
                        __global int* x2,
                        __global float* x3,
                        __global float* x4,
                        __global float* x5,
                        __global float* x6,
                        __global float* x7,
                        __global float* y)
{
  int id = get_global_id(0);    
  for (int j = 0; j < 1000; j++)
    y[id] = TTsum(x1[id], x2[id], x3[id], x4[id], x5[id], x6[id], x7[id]);
}

Can someone explain the cache behavior? The experiments are done in GTX580.

887

asked Jul 19 '11 14:07

Zk1001

1 Answers

It's quite hard to calculate histograms in CUDA. I believe the random access on y[] may very well be the cause for the behaviour you observe. Maybe read this if you haven't: http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/histogram256/doc/histogram.pdf

104

answered Oct 05 '22 19:10

whoplisp

Related questions
                            
                                Number of Compute Units corresponding to the number of work groups
                            
                                Confusion on CUDA/openCL and C++ AMP
                            
                                OpenCL - is it possible to invoke another function from within a kernel?
                            
                                OpenCL for Python
                            
                                How can I test for OpenCL compatibility?
                            
                                How does the Levenberg–Marquardt algorithm work in detail but in an understandable way?
                            
                                Why does my OpenCL kernel fail on the nVidia driver, but not Intel (possible driver bug)?
                            
                                In OpenCL, what does mem_fence() do, as opposed to barrier()?
                            
                                PyOpenCl: how to debug segmentation fault?
                            
                                Using Python+Theano with OpenCL in an AMD GPU
                            
                                How to match OpenCL devices with a specific GPU given PCI vendor, device and bus IDs in a multi-GPU system?
                            
                                OpenCL GPU Audio
                            
                                In OpenCL, what is the difference between platform, context, and device?
                            
                                Preprocessor tomfoolery (stringifying a #include)
                            
                                GPU vs CPU performance for common algorithms [closed]
                            
                                How do JavaCL and JogAmp JOCL compare?
                            
                                "Unrolling" a recursive function?
                            
                                When to use OpenCL?
                            
                                Is it fair to compare SSE/AVX units to GPU cores?
                            
                                Why aren't there bank conflicts in global memory for Cuda/OpenCL?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

The variation of cache misses in GPU

Tags:

gpgpu

gpu

opencl

Zk1001

People also ask

1 Answers

whoplisp

Recent Activity

Donate For Us