I have started learning cuda for a while and I have the following problem
See how I am doing below:
Copy GPU
int* B;
// ...
int *dev_B;
//initialize B=0
cudaMalloc((void**)&dev_B, Nel*Nface*sizeof(int));
cudaMemcpy(dev_B, B, Nel*Nface*sizeof(int),cudaMemcpyHostToDevice);
//...
//Execute on GPU the following function which is supposed to fill in
//the dev_B matrix with integers
findNeiborElem <<< Nblocks, Nthreads >>>(dev_B, dev_MSH, dev_Nel, dev_Npel, dev_Nface, dev_FC);
Copy CPU again
cudaMemcpy(B, dev_B, Nel*Nface*sizeof(int),cudaMemcpyDeviceToHost);
The findNeiborElem function involves a loop for each thread e.g. it looks like that
__ global __ void findNeiborElem(int *dev_B, int *dev_MSH, int *dev_Nel, int *dev_Npel, int *dev_Nface, int *dev_FC){
int tid=threadIdx.x + blockIdx.x * blockDim.x;
while (tid<dev_Nel[0]){
for (int j=1;j<=Nel;j++){
// do some calculations
B[ind(tid,1,Nel)]=j// j in most cases do no go all the way to the Nel reach
break;
}
tid += blockDim.x * gridDim.x;
}
}
What's very wierd about it, is that the time to copy dev_B to B is proportional to the number of iterations of j index.
For example if Nel=5
then the time is approx 5 sec
.
When I increase the Nel=20
the time is about 20 sec
.
I would expect that the copy time should be independent of the inner iterations one need to assign the value of the Matrix dev_B
.
Also I would expect that the time to copy the same matrix from and to CPU would be of the same order.
Do you have any idea what is wrong?
Instead of using clock() to measure time, you should use events:
With events you would have something like this:
cudaEvent_t start, stop; // variables that holds 2 events
float time; // Variable that will hold the time
cudaEventCreate(&start); // creating the event 1
cudaEventCreate(&stop); // creating the event 2
cudaEventRecord(start, 0); // start measuring the time
// What you want to measure
cudaMalloc((void**)&dev_B, Nel*Nface*sizeof(int));
cudaMemcpy(dev_B, B, Nel*Nface*sizeof(int),cudaMemcpyHostToDevice);
cudaEventRecord(stop, 0); // Stop time measuring
cudaEventSynchronize(stop); // Wait until the completion of all device
// work preceding the most recent call to cudaEventRecord()
cudaEventElapsedTime(&time, start, stop); // Saving the time measured
EDIT : Additional information :
"The kernel launch returns control to the CPU thread before it is finished. Therefore your timing construct is measuring both the kernel execution time as well as the 2nd memcpy. When timing the copy after the kernel, your timer code is being executed immediately, but the cudaMemcpy is waiting for the kernel to complete before it starts. This also explains why your timing measurement for the data return seems to vary based on kernel loop iterations. It also explains why the time spent on your kernel function is "negligible"". credits to Robert Crovella
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With