I have multiple kernels, and they are launched in sequential manner like this:
        clEnqueueNDRangeKernel(..., kernel1, ...);
        clEnqueueNDRangeKernel(..., kernel2, ...);
        clEnqueueNDRangeKernel(..., kernel3, ...);
and, multiple kernels share one global buffer.
Now, I profile every kernel execution and sum them up to count total execution time by adding the code block after clEnqueueNDRangeKernel:
        clFinish(cmdQueue);
        status = clGetEventProfilingInfo(...,&starttime,...);
        clGetEventProfilingInfo(...,&endtime,...);
        time_spent = endtime - starttime;
My questions is that how to profile three kernels all together by one clFinish? (like adding one clFinish() after the last kernel launching).
Yes, I give every clEnqueueNDRangeKernel different time event, and get large Negative number. The detail information:
clEnqueueNDRangeKernel(cmdQueue,...,&timing_event1);
clFinish(cmdQueue);
clGetEventProfilingInfo(timing_event1,CL_PROFILING_COMMAND_START,sizeof(cl_ulong),&starttime1,NULL);
clGetEventProfilingInfo(timing_event1,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),&endtime1,NULL);
time_spent1 = endtime1 - starttime1;
clEnqueueNDRangeKernel(cmdQueue,...,&timing_event2);
clFinish(cmdQueue);
clGetEventProfilingInfo(timing_event2,CL_PROFILING_COMMAND_START,sizeof(cl_ulong),&starttime2,NULL);
clGetEventProfilingInfo(timing_event2,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),&endtime2,NULL);
time_spent2 = endtime2 - starttime2;
clEnqueueNDRangeKernel(cmdQueue,...,&timing_event3);
clFinish(cmdQueue);
clGetEventProfilingInfo(timing_event3,CL_PROFILING_COMMAND_START,sizeof(cl_ulong),&starttime3,NULL);
clGetEventProfilingInfo(timing_event3,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),&endtime3,NULL);
time_spent3 = endtime3 - starttime3;
time_spent_all_0 = time_spent1 + time_spent2 + time_spent3;
time_spent_all_1 = endtime3 - starttime1;
If I have every clFinish, all profiling values are reasonable, but time_spent_all_1 is about 2 times over time_spent_all_0. If I remove all clFinish except for the last clFinish, all profiling values are non reasonable.
Thanks to Eric Bainville that I have gotten the result I want: profiling multiple clEnqueueNDRangeKernel by one clFinish. The following is final code I use:
clEnqueueNDRangeKernel(cmdQueue,...,&timing_event1);
clEnqueueNDRangeKernel(cmdQueue,...,&timing_event2);
clEnqueueNDRangeKernel(cmdQueue,...,&timing_event3);
clFinish(cmdQueue);
clGetEventProfilingInfo(timing_event1,CL_PROFILING_COMMAND_START,sizeof(cl_ulong),&starttime,NULL);
clGetEventProfilingInfo(timing_event3,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),&endtime,NULL);
time_spent = endtime - starttime;
Each clEnqueueNDRangeKernel will create its own cl_event: the last arg of the call is a pointer to a cl_event; if this last arg is not 0, a new event will be created.
After a command has completed, the associated event can be queried the start/end profiling info. This event must be released after use (call clReleaseEvent).
clFinish blocks until all enqueued commands are completed.
You need only one call to clFinish, and then you can query profiling info for all events.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With