Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Constant memory usage in CUDA code

I can not figure it out myself, what is the best way to ensure the memory used in my kernel is constant. There is a similar question at http://stackoverflow...r-pleasant-way. I am working with GTX580 and compiling only for 2.0 capability. My kernel looks like

__global__ Foo(const int *src, float *result) {...}

I execute the following code on host:

cudaMalloc(src, size);
cudaMemcpy(src, hostSrc, size, cudaMemcpyHostToDevice);
Foo<<<...>>>(src, result);

the alternative way is to add

__constant__ src[size];

to .cu file, remove src pointer from the kernel and execute

cudaMemcpyToSymbol("src", hostSrc, size, 0, cudaMemcpyHostToDevice);
Foo<<<...>>>(result);

Are these two ways equivalent or the first one does not guarantee the usage of constant memory instead of global memory? size changes dynamically so the second way is not handy in my case.

like image 692
markhor Avatar asked Jan 30 '12 04:01

markhor


People also ask

What is constant memory used for in CUDA?

It is used for storing data that will not change over the course of kernel execution. It supports short-latency, high-bandwidth, read-only access by the device when all threads simultaneously access the same location. There is a total of 64K constant memory on a CUDA capable device. The constant memory is cached.

What is a constant variable in CUDA?

Constant memory is a read-only cache which content can be broadcasted to multiple threads in a block. A variable allocated in constant memory needs to be declared in CUDA by using the special __constant__ identifier, and it must be a global variable, i.e. it must be declared in the scope that contains the kernel, not inside the kernel itself.

How do I pass arguments to a device in CUDA?

See CUDA documentation for more precise information, please. Also, with Fermi-class devices you can just malloc the memory (cudaMalloc), copy to the device memory, and then pass the argument as a const pointer. The compiler will recognise if you are accessing the data uniformly across the warps and if so will use the constant cache.

Why do we need to get the global array factors in CUDA?

This is necessary because constant memory is defined in the CUDA code, so we need CUDA to allocate the necessary memory, and then provide us with a pointer to this memory. By calling the method get_global we ask the CUDA subsystem to provide us with the location of a global object, in this case the array factors .


1 Answers

The second way is the only way to ensure that the array is compiled to CUDA constant memory and accessed correctly via the constant memory cache. But you should ask yourself how the contents of that array are going to be accessed within a block of threads. If every thread will access the array uniformly, then there will be a performance advantage in using constant memory, because there is a broadcast mechanism from the constant memory cache (it also saves global memory bandwidth because constant memory is stored in offchip DRAM and the cache reduces the DRAM transaction count). But if access is random, then there can be serialisation of access to local memory which will negatively effect performance.

Typical things which might be good fits for __constant__ memory would be model coefficients, weights, and other constant values which need to be set at runtime. On Fermi GPUs, the kernel argument list is stored in constant memory, for example. But if the contents are access non-uniformly and the type or size of members isn't constant from call to call, then normal global memory is preferable.

Also keep in mind that there is a limit of 64kb of constant memory per GPU context, so is it not practical to store very large amounts of data in constant memory. If you need a lot of read-only storage with cache, it might be worth trying binding the data to a texture and see what the performance is like. On pre-Fermi cards, it usually yields a handy performance gain, on Fermi the results can be less predictable compared to global memory because of the improve cache layout in that architecture.

like image 141
talonmies Avatar answered Oct 14 '22 10:10

talonmies