I recently changed the surface reference of my algorithm for a surface object. Then, I noticed that the program runs slower.
Here is a comparison for simple example where I fill a 3D floating array [400*400*400] with a constant value.
Time: 9.068928 ms
surface<void, cudaSurfaceType3D> s_volumeSurf;
...
surf3Dwrite(value, s_volumeSurf, px*sizeof(float), py, pz, cudaBoundaryModeTrap);
Time: 14.960256 ms
cudaSurfaceObject_t l_volSurfObj;
...
surf3Dwrite(value, l_volSurfObj, px*sizeof(float), py, pz, cudaBoundaryModeTrap);
This was tested on a GTX 680 with Compute Capability 3.0 and CUDA 5.0.
Does anyone have an explanation for this difference?
In the surface object case, surface descriptors are fetched from global memory. In the surface reference case, these descriptors are compiled into constant memory. Fetching these descriptors may be much faster than global memory access. If your kernel is small enough or L1 cache is disabled, you could observe significant performance loss.
You can diff the SASS code to see the difference.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With