Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CUDA: are access times for texture memory similar to coalesced global memory?

My kernel threads access a linear character array in a coalesced fashion. If I map the array to texture I don't see any speedup. The running times are almost the same. I'm working on a Tesla C2050 with compute capability 2.0 and read somewhere that global accesses are cached. Is that true? Perhaps that is why I am not seeing a difference in the running time.

The array in the main program is

char *dev_database = NULL;
cudaMalloc( (void**) &dev_database, JOBS * FRAGMENTSIZE * sizeof(char) );

and I bind it to texture texture<char> texdatabase with

cudaBindTexture(NULL, texdatabase, dev_database, JOBS * FRAGMENTSIZE * sizeof(char) );

Each thread then reads a character ch = tex1Dfetch(texdatabase, p + id) where id is threadIdx.x + blockIdx.x * blockDim.x and p is an offset.

I'm binding only once and dev_database is a large array. Actually I found that if the size is too large the bind fails. Is there a limit on the size of the array to bind? Thanks very much.

like image 528
Ross Avatar asked Dec 20 '25 18:12

Ross


1 Answers

There are several possibilities for why you don't see any difference in performance, but the most likely is that this memory access is not your bottleneck. If it is not your bottleneck, making it faster will have no effect on performance.

Regarding caching: for this case, since you are reading only bytes, each warp will read 32 bytes, which means each group of 4 warps will map to each cache line. So assuming few cache conflicts, you will get up to 4x reuse from the cache. So if this memory access is a bottleneck, it is conceivable that the texture cache might not benefit you more than the general-purpose cache.

You should first determine if you are bandwidth bound and if this data access is the culprit. Once you have done that, then optimize your memory accesses. Another tactic to consider is to access 4 to 16 chars per thread per load (using a char4 or int4 struct with byte packing/unpacking) rather than one per thread to increase the number of memory transactions in flight at a time -- this can help to saturate the global memory bus.

There is a good presentation by Paulius Micikevicius from GTC 2010 that you might want to watch. It covers both analysis-driven optimization and the specific concept of memory transactions in flight.

like image 164
harrism Avatar answered Dec 22 '25 22:12

harrism



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!