Implementing CUDA VecAdd from sample code

I'm trying to test out a sample code from the CUDA site http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#kernels.

I simply want to add two arrays A and B of size 4, and store it in array C. Here is what I have so far:

#include <stdio.h>
#include "util.h"
void print_array(int* array, int size) {
int i;
for (i = 0; i < size; i++) {
    printf("%d ", array[i]);
}
printf("\n");
}

__global__ void VecAdd(int* A, int* B, int* C) {
int i = threadIdx.x;
C[i] = A[i] + B[i];
}

int main(int argc , char **argv) {
int N = 4;
    int i;
int *A = (int *) malloc(N * sizeof(int));
int *B = (int *) malloc(N * sizeof(int));
int *C = (int *) malloc(N * sizeof(int));

for (i = 0; i < N; i++) {
    A[i] = i + 1;
    B[i] = i + 1;
}

print_array(A, N);
print_array(B, N);


VecAdd<<<1, N>>>(A, B, C);
print_array(C, N);
    return 0;
}

I'm expecting the C array (the last row of the output) to be 2, 4, 6, 8, but it doesn't seem to get added:

1 2 3 4
1 2 3 4
0 0 0 0

What am I missing?

How do you add vectors to CUDA?

CUDA Code for Vector Addallocate memory on the host for the data arrays. initialze the data arrays in the host's memory. allocate separate memory on the GPU device for the data arrays. copy data arrays from the host memory to the GPU device memory.

What is CUDA device synchronize?

Before we can use CUDA streams, we need to understand the notion of device synchronization. This is an operation where the host blocks any further execution until all operations issued to the GPU (memory transfers and kernel executions) have completed.

What is function of __ global __ qualifier in CUDA program?

__global__ : 1. A qualifier added to standard C. This alerts the compiler that a function should be compiled to run on a device (GPU) instead of host (CPU).

What is CUDA write CUDA program to add two array by their index?

Here is the Complete code For Adding elements of two array into another array. const int arraySize = 5; const int a[arraySize] = { 1, 2, 3, 4, 5 }; const int b[arraySize] = { 10, 20, 30, 40, 50 };

First, you have to define the pointers that will hold the data that will be copied to GPU:

In your example, we want to copy the arrays 'a','b' and 'c' from CPU to the GPU's global memory.

int a[array_size], b[array_size],c[array_size]; // your original arrays
int *a_cuda,*b_cuda,*c_cuda;                    // defining the "cuda" pointers

define the size that each array will occupy.

int size = array_size * sizeof(int); // Is the same for the 3 arrays

Then you will allocate the space to the data that will be used in cuda:

Cuda memory allocation:

msg_erro[0] = cudaMalloc((void **)&a_cuda,size);
msg_erro[1] = cudaMalloc((void **)&b_cuda,size);
msg_erro[2] = cudaMalloc((void **)&c_cuda,size);

Now we need to copy this data from CPU to the GPU:

Copy from CPU to GPU:

msg_erro[3] = cudaMemcpy(a_cuda, a,size,cudaMemcpyHostToDevice);
msg_erro[4] = cudaMemcpy(b_cuda, b,size,cudaMemcpyHostToDevice);
msg_erro[5] = cudaMemcpy(c_cuda, c,size,cudaMemcpyHostToDevice);

Execute the kernel

int blocks = //;
int threads_per_block = //;
VecAdd<<<blocks, threads_per_block>>>(a_cuda, b_cuda, c_cuda);

Copy the results from GPU to CPU (in our example array C):

msg_erro[6] = cudaMemcpy(c,c_cuda,size,cudaMemcpyDeviceToHost);

Free Memory:

cudaFree(a_cuda);
cudaFree(b_cuda);
cudaFree(c_cuda);

For debugging purposes, I normally save the status of the functions on an array, like this:

cudaError_t msg_erro[var];

However, this is not strictly necessary but it will save you time if an error occurs during the allocation or memory transference. You can take out all the 'msg_erro[x] =' from the code above if you wish.

If you mantain the 'msg_erro[x] =', and if a error does occur you can use a function like the one that follows, to print these erros:

void printErros(cudaError_t *erros,int size)
{
 for(int i = 0; i < size; i++)
      printf("{%d} => %s\n",i ,cudaGetErrorString(erros[i]));
}

You need to transfer the memory back and forth from/to the GPU, something like

    int *a_GPU, *b_GPU, *c_GPU;
        
    cudaMalloc(&a_GPU, N*sizeof(int));
    cudaMalloc(&b_GPU, N*sizeof(int));
    cudaMalloc(&c_GPU, N*sizeof(int));
        
    cudaMemcpy(a_GPU, A, N*sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(b_GPU, B, N*sizeof(int), cudaMemcpyHostToDevice);

    VecAdd<<<1, N>>>(a_GPU, b_GPU, c_GPU);

    cudaMemcpy(C, c_GPU, N*sizeof(int), cudaMemcpyDeviceToHost);
        
    print_array(C, N);

    cudaFree(a_GPU);
    cudaFree(b_GPU);
    cudaFree(c_GPU);

Implementing CUDA VecAdd from sample code

Tags:

arrays

c

parallel-processing

cuda

gpu

badjr

People also ask

2 Answers

dreamcrash

WildCrustacean

Recent Activity

Donate For Us

Implementing CUDA VecAdd from sample code

Tags:

arrays

c

parallel-processing

cuda

gpu

badjr

People also ask

2 Answers

dreamcrash

WildCrustacean

Related questions

Recent Activity

Donate For Us