I have been toying an OpenCL kernel that access 7 global memory buffers, do something on the values and store the result back to a 8th global memory buffer. As I observed, as the input size increases, the L1 cache miss ratio (=misses(misses + hits)) varies a lot. I can't find the source of this variation. The input size here means the number of global work items (a power of 2, and a multiple of workgroup size). The number of workgroup size remains 256.
These are the results. These show the L1 cache miss ratio. Starting from 4096 work-items (16 workgroups).
0.677125
0.55946875
0.345994792
0.054078125
0.436167969
0.431871745
0.938546224
0.959258789
0.952941406
0.955016479
The profiler says it uses 18 registers per thread. Here is the code (the function TTsum() is supposed to do just a bunch of dependent transcendent operations, so it has nothing to do with caches I guess) :
float TTsum(float x1, float x2, float x3, float x4, float x5, float x6, float x7)
{
float temp = 0;
for (int j = 0; j < 2; j++)
temp = temp + x1 + (float)x2 + x3 + x4 + x5 + x6 + x7;
temp = sqrt(temp);
temp = exp(temp);
temp = temp / x1;
temp = temp / (float)x2;
for (int j = 0; j < 20; j++) temp = sqrt(temp);
return temp;
}
__kernel void histogram(__global float* x1,
__global int* x2,
__global float* x3,
__global float* x4,
__global float* x5,
__global float* x6,
__global float* x7,
__global float* y)
{
int id = get_global_id(0);
for (int j = 0; j < 1000; j++)
y[id] = TTsum(x1[id], x2[id], x3[id], x4[id], x5[id], x6[id], x7[id]);
}
Can someone explain the cache behavior? The experiments are done in GTX580.
The GPU cache node routes cached data directly to the system graphics card for processing, bypassing Maya dependency graph evaluation. This data flow alleviates performance issues that arise while opening and playing back large scenes with heavy data sets.
What Is a Cache Miss? A cache miss is an event in which a system or application makes a request to retrieve data from a cache, but that specific data is not currently in cache memory. Contrast this to a cache hit, in which the requested data is successfully retrieved from the cache.
As seen above, all GPUs have a cache called L2 cache. And we know that within the CPU also there is a cache called L2 cache. Here also as with memory, size of L2 cache on GPU is much smaller than size of L2 or L3 cache on CPU.
It's quite hard to calculate histograms in CUDA. I believe the random access on y[] may very well be the cause for the behaviour you observe. Maybe read this if you haven't: http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/histogram256/doc/histogram.pdf
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With