I have multiple kernels, and they are launched in sequential manner like this:
clEnqueueNDRangeKernel(..., kernel1, ...);
clEnqueueNDRangeKernel(..., kernel2, ...);
clEnqueueNDRangeKernel(..., kernel3, ...);
and, multiple kernels share one global buffer.
Now, I profile every kernel execution and sum them up to count total execution time by adding the code block after clEnqueueNDRangeKernel:
clFinish(cmdQueue);
status = clGetEventProfilingInfo(...,&starttime,...);
clGetEventProfilingInfo(...,&endtime,...);
time_spent = endtime - starttime;
My questions is that how to profile three kernels all together by one clFinish? (like adding one clFinish() after the last kernel launching).
Yes, I give every clEnqueueNDRangeKernel different time event, and get large Negative number. The detail information:
clEnqueueNDRangeKernel(cmdQueue,...,&timing_event1);
clFinish(cmdQueue);
clGetEventProfilingInfo(timing_event1,CL_PROFILING_COMMAND_START,sizeof(cl_ulong),&starttime1,NULL);
clGetEventProfilingInfo(timing_event1,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),&endtime1,NULL);
time_spent1 = endtime1 - starttime1;
clEnqueueNDRangeKernel(cmdQueue,...,&timing_event2);
clFinish(cmdQueue);
clGetEventProfilingInfo(timing_event2,CL_PROFILING_COMMAND_START,sizeof(cl_ulong),&starttime2,NULL);
clGetEventProfilingInfo(timing_event2,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),&endtime2,NULL);
time_spent2 = endtime2 - starttime2;
clEnqueueNDRangeKernel(cmdQueue,...,&timing_event3);
clFinish(cmdQueue);
clGetEventProfilingInfo(timing_event3,CL_PROFILING_COMMAND_START,sizeof(cl_ulong),&starttime3,NULL);
clGetEventProfilingInfo(timing_event3,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),&endtime3,NULL);
time_spent3 = endtime3 - starttime3;
time_spent_all_0 = time_spent1 + time_spent2 + time_spent3;
time_spent_all_1 = endtime3 - starttime1;
If I have every clFinish, all profiling values are reasonable, but time_spent_all_1 is about 2 times over time_spent_all_0. If I remove all clFinish except for the last clFinish, all profiling values are non reasonable.
Thanks to Eric Bainville that I have gotten the result I want: profiling multiple clEnqueueNDRangeKernel by one clFinish. The following is final code I use:
clEnqueueNDRangeKernel(cmdQueue,...,&timing_event1);
clEnqueueNDRangeKernel(cmdQueue,...,&timing_event2);
clEnqueueNDRangeKernel(cmdQueue,...,&timing_event3);
clFinish(cmdQueue);
clGetEventProfilingInfo(timing_event1,CL_PROFILING_COMMAND_START,sizeof(cl_ulong),&starttime,NULL);
clGetEventProfilingInfo(timing_event3,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),&endtime,NULL);
time_spent = endtime - starttime;
Each clEnqueueNDRangeKernel
will create its own cl_event
: the last arg of the call is a pointer to a cl_event
; if this last arg is not 0, a new event will be created.
After a command has completed, the associated event can be queried the start/end profiling info. This event must be released after use (call clReleaseEvent
).
clFinish
blocks until all enqueued commands are completed.
You need only one call to clFinish
, and then you can query profiling info for all events.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With