I wrote my sample code like this.
int ** d_ptr;
cudaMalloc( (void**)&d_ptr, sizeof(int*)*N );
int* tmp_ptr[N];
for(int i=0; i<N; i++)
cudaMalloc( (void**)&tmp_ptr[i], sizeof(int)*SIZE );
cudaMemcpy(d_ptr, tmp_ptr, sizeof(tmp_ptr), cudaMemcpyHostToDevice);
And this code works well but after kernel launching I can't receive the result.
int* Mtx_on_GPU[N];
cudaMemcpy(Mtx_on_GPU, d_ptr, sizeof(int)*N*SIZE, cudaMemcpyDeviceToHost);
At this point, segment-fault-error occurs. But I don't know what I'm wrong.
int* Mtx_on_GPU[N];
for(int i=0; i<N; i++)
cudaMemcpy(Mtx_on_GPU[i], d_ptr[i], sizeof(int)*SIZE, cudaMemcpyDeviceToHost);
This code have also same error.
I think certainly my code has some mistakes but I can't find it during all daytime.
Give me some advice.
What cudaMalloc() does is that it allocates a memory pointer (with space) on GPU which is then pointed by the first argument we give.
Yes, cudaMalloc and cudaFree are blocking and synchronize across all kernels executing on the current GPU.
cudaMallocHost: Allocates page-locked memory on the host in duncantl/RCUDA: R Bindings for the CUDA Library for GPU Computing.
From online documentation: cudaError_t cudaMemset (void * devPtr, int value, size_t count ) Fills the first count bytes of the memory area pointed to by devPtr with the constant byte value value.
In the last line
cudaMemcpy(Mtx_on_GPU[i], d_ptr[i], sizeof(int)*SIZE, cudaMemcpyDeviceToHost);
you are trying to copy data from the device to the host (NOTE: I assume that you allocated host memory for the Mtx_on_GPU
pointers!)
However, the pointers are stored in device memory, so you can't access the directly from host side. The line should be
cudaMemcpy(Mtx_on_GPU[i], temp_ptr[i], sizeof(int)*SIZE, cudaMemcpyDeviceToHost);
This may become clearer when using "overly elaborate" variable names:
int ** devicePointersStoredInDeviceMemory;
cudaMalloc( (void**)&devicePointersStoredInDeviceMemory, sizeof(int*)*N);
int* devicePointersStoredInHostMemory[N];
for(int i=0; i<N; i++)
cudaMalloc( (void**)&devicePointersStoredInHostMemory[i], sizeof(int)*SIZE );
cudaMemcpy(
devicePointersStoredInDeviceMemory,
devicePointersStoredInHostMemory,
sizeof(int*)*N, cudaMemcpyHostToDevice);
// Invoke kernel here, passing "devicePointersStoredInDeviceMemory"
// as an argument
...
int* hostPointersStoredInHostMemory[N];
for(int i=0; i<N; i++) {
int* hostPointer = hostPointersStoredInHostMemory[i];
// (allocate memory for hostPointer here!)
int* devicePointer = devicePointersStoredInHostMemory[i];
cudaMemcpy(hostPointer, devicePointer, sizeof(int)*SIZE, cudaMemcpyDeviceToHost);
}
EDIT in response to the comment:
The d_ptr
is "an array of pointers". But the memory of this array is allocated with cudaMalloc
. That means that it is located on the device. In contrast to that, with int* Mtx_on_GPU[N];
you are "allocating" N pointers in host memory. Instead of specifying the array size, you could also have used malloc
. It may become clearer when you compare the following allocations:
int** pointersStoredInDeviceMemory;
cudaMalloc((void**)&pointersStoredInDeviceMemory, sizeof(int*)*N);
int** pointersStoredInHostMemory;
pointersStoredInHostMemory = (void**)malloc(N * sizeof(int*));
// This is not possible, because the array was allocated with cudaMalloc:
int *pointerA = pointersStoredInDeviceMemory[0];
// This is possible because the array was allocated with malloc:
int *pointerB = pointersStoredInHostMemory[0];
It may be a little bit brain-twisting to keep track of
but fortunately, it hardly becomes more than 2 indirections.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With