How can I create global variables in CUDA?? Could you please give me an example? How can create arrays inside a CUDA function for example <pre class="prettyprint"><code>__global__ void test() { int *a = new int[10]; } </code></pre> or How can I create a global array and access it from this function. for example <pre class="prettyprint"><code>__device__ int *a; __global__ void test() { a[0] = 2; } </code></pre> Or How can I use like the following.. <pre class="prettyprint"><code>__global__ void ProcessData(int img) { int *neighborhood = new int[8]; getNeighbourhood(img, neighbourhood); } </code></pre> <hr> Still I have some problem with this. I found that compare to <pre class="prettyprint"><code>__device__ </code></pre> if I define <pre class="prettyprint"><code>"__device__ __constant__" (read only) </code></pre> will improve the memory access. But my problem is I have an array in host memory say <pre class="prettyprint"><code> float *arr = new float[sizeOfTheArray]; </code></pre> I want to make it as a variable array in device and I need to modify this in device memory and I need to copy this back to host. How can I do it??

The C++ <code>new</code> operator is supported on compute capability 2.0 and 2.1 (ie. Fermi) with CUDA 4.0, so you could use <code>new</code> to allocate global memory onto a device symbol, although neither of your first two code snippets are how it would be done in practice. On older hardware, and/or with pre CUDA 4.0 toolkits, the standard approach is to use the <code>cudaMemcpyToSymbol</code> API in host code: <pre class="prettyprint"><code>__device__ float *a; int main() { const size_t sz = 10 * sizeof(float); float *ah; cudaMalloc((void **)&ah, sz); cudaMemcpyToSymbol("a", &ah, sizeof(float *), size_t(0),cudaMemcpyHostToDevice); } </code></pre> which copies a dynamically allocated device pointer onto a symbol which can be used directly in device code. <hr> EDIT: Answering this question is a bit like hitting a moving target. For the constant memory case you now seem interested in, here is a complete working example: <pre class="prettyprint"><code>#include <cstdio> #define nn (10) __constant__ float a[nn]; __global__ void kernel(float *out) { if (threadIdx.x < nn) out[threadIdx.x] = a[threadIdx.x]; } int main() { const size_t sz = size_t(nn) * sizeof(float); const float avals[nn]={ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10. }; float ah[nn]; cudaMemcpyToSymbol("a", &avals[0], sz, size_t(0),cudaMemcpyHostToDevice); float *ad; cudaMalloc((void **)&ad, sz); kernel<<<dim3(1),dim3(16)>>>(ad); cudaMemcpy(&ah[0],ad,sz,cudaMemcpyDeviceToHost); for(int i=0; i<nn; i++) { printf("%d %f\n", i, ah[i]); } } </code></pre> This shows copying data onto a constant memory symbol, and using that data inside a kernel.

Global variable in CUDA

Tags:

cuda

How can I create global variables in CUDA?? Could you please give me an example?

How can create arrays inside a CUDA function for example

__global__ void test()
{
  int *a = new int[10];
}

or How can I create a global array and access it from this function. for example

__device__ int *a;
__global__ void test()
{
  a[0] = 2;
}

Or How can I use like the following..

__global__ void ProcessData(int img)
{
   int *neighborhood = new int[8]; 
   getNeighbourhood(img, neighbourhood);
}

Still I have some problem with this. I found that compare to

__device__

if I define

"__device__ __constant__" (read only)

will improve the memory access. But my problem is I have an array in host memory say

 float *arr = new float[sizeOfTheArray];

I want to make it as a variable array in device and I need to modify this in device memory and I need to copy this back to host. How can I do it??

606

asked Jun 06 '11 16:06

user570593

1 Answers

The C++ new operator is supported on compute capability 2.0 and 2.1 (ie. Fermi) with CUDA 4.0, so you could use new to allocate global memory onto a device symbol, although neither of your first two code snippets are how it would be done in practice.

On older hardware, and/or with pre CUDA 4.0 toolkits, the standard approach is to use the cudaMemcpyToSymbol API in host code:

__device__ float *a;

int main()
{
    const size_t sz = 10 * sizeof(float);

    float *ah;
    cudaMalloc((void **)&ah, sz);
    cudaMemcpyToSymbol("a", &ah, sizeof(float *), size_t(0),cudaMemcpyHostToDevice);
}

which copies a dynamically allocated device pointer onto a symbol which can be used directly in device code.

EDIT: Answering this question is a bit like hitting a moving target. For the constant memory case you now seem interested in, here is a complete working example:

#include <cstdio>

#define nn (10)

__constant__ float a[nn];

__global__ void kernel(float *out)
{
    if (threadIdx.x < nn)
        out[threadIdx.x] = a[threadIdx.x];

}

int main()
{
    const size_t sz = size_t(nn) * sizeof(float);
    const float avals[nn]={ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10. };
    float ah[nn];

    cudaMemcpyToSymbol("a", &avals[0], sz, size_t(0),cudaMemcpyHostToDevice);

    float *ad;
    cudaMalloc((void **)&ad, sz);

    kernel<<<dim3(1),dim3(16)>>>(ad);

    cudaMemcpy(&ah[0],ad,sz,cudaMemcpyDeviceToHost);

    for(int i=0; i<nn; i++) {
        printf("%d %f\n", i, ah[i]);
    }
}

This shows copying data onto a constant memory symbol, and using that data inside a kernel.

178

answered Sep 21 '22 20:09

talonmies

Related questions
                            
                                How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications?
                            
                                Enable code indexing of Cuda in Clion
                            
                                How to get the assembly code of a CUDA kernel?
                            
                                Redefinitions when compiling CUDA with clang on Windows
                            
                                Why does CUDA float program get faster in full speed FP64 mode?
                            
                                Why does cuFFT performance suffer with overlapping inputs?
                            
                                GPUDirect RDMA transfer from GPU to remote host
                            
                                What do I need for programming for Tegra GPU
                            
                                Are there MapReduce implementations on GPUs (CUDA)?
                            
                                Using GPU inside docker container - CUDA Version: N/A and torch.cuda.is_available returns False
                            
                                Timing different sections in CUDA kernel
                            
                                CUDA Zero Copy memory considerations
                            
                                Using CUDA to solve a system of equations in non-linear least squares fashion
                            
                                Upload data in shared memory for convolution kernel
                            
                                CUDA atomic operation performance in different scenarios
                            
                                What is a good sorting algorithm on CUDA?
                            
                                Is CUDA hardware needed at compile time?
                            
                                Where are the GPU functions on OpenCV 3.0?
                            
                                What is the best way to learn CUDA? [closed]
                            
                                CUDA streams not overlapping

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With