Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Yet Another CUDA Texture Memory Thread. (Why should texture memory be faster on Fermi?)

There are quite a few stackoverflow threads asking why a kernel using textures is not faster than one using global memory access. The answers and comments seem always a little bit esoteric to me.

The NVIDIA white paper on the Fermi architecture states black on white:

The Fermi architecture addresses this challenge by implementing a single unified memory request path for loads and stores, with an L1 cache per SM multiprocessor and unified L2 cache that services all operations (load, store and texture).

So why on earth should one expect any speed up from using texture memory on Fermi devices, since for every memory fetch (regardless wether it's bound to a texture or not) the same L2 cache is used. Actually for most cases direct access to global memory should be faster since it is also cached through L1 which a texture fetch isn't. This is also reported in a few related questions here on stackoverflow.

Can someone confirm this or show me what I'm missing?

like image 617
betapatch Avatar asked Dec 11 '22 03:12

betapatch


1 Answers

You are neglecting that each Streaming Multiprocessor has a texture cache (see the picture below illustrating a Streaming Multiprocessor for Fermi).

enter image description here

Texture cache has a different meaning than L1/L2 cache, since it is optimized for data locality. Data locality applies to all the cases when data concerning semantically (not physically) neighboring points of regular, Cartesian, 1D, 2D or 3D grids must be accessed. To better explain this concept, consider the following figure illustrating the stencil as involved in 2D or 3D finite difference calculations

enter image description here

Calculating finite differences at the red point involves accessing the data associated to the blue points. Now, these data aren't physical neighbors of the red points since they will not be physically stored consecutively in global memory when flattening the 2D or 3D array to 1D. However, they are semantical neighbors of the red points and texture memory is right good at caching these values. On the other side, L1/L2 caches are good when the same datum or its physical neighbors must be frequently accessed.

The other side of the medal is that texture cache as a higher latency as compared to L1/L2 cache, so, in some cases, not using texture may not lead to a significany worsening of the performance, just thanks to the L1/L2 caching mechanism. From this point of view, texture had top importance in the early CUDA architectures, when global memory reads were not cached. But, as demonstrated in Is 1D texture memory access faster than 1D global memory access?, texture memory for Fermi is worth to be used.

like image 117
Vitality Avatar answered Feb 15 '23 10:02

Vitality