The first cudaMalloc call is slow (like 0.2 sec) because of some initialization work on GPU. Is there any function that solely do initialization, so that I can separate the time? cudaSetDevice seems to reduce the time to 0.15 secs, but still does not eliminate all init overheads.
cudaMalloc is a function that can be called from the host or the device to allocate memory on the device, much like malloc for the host. The memory allocated with cudaMalloc must be freed with cudaFree.
__global__ : 1. A qualifier added to standard C. This alerts the compiler that a function should be compiled to run on a device (GPU) instead of host (CPU).
allocate and free memory dynamically from a fixed-size heap in global memory. The CUDA in-kernel malloc() function allocates at least size bytes from the device heap and returns a pointer to the allocated memory or NULL if insufficient memory exists to fulfill the request.
To execute any CUDA program, there are three main steps: Copy the input data from host memory to device memory, also known as host-to-device transfer. Load the GPU program and execute, caching data on-chip for performance. Copy the results from device memory to host memory, also called device-to-host transfer.
A call to
cudaFree(0);
is the canonical way to force lazy context establishment in the CUDA runtime. You can't reduce the overhead, that is a function of driver, runtime and operating system latencies. But the call above will let you control how/when those overheads occur during program execution.
EDIT in 2015 to add that the heuristics of context initialisation in the runtime API have subtly changed over time so that cudaSetDevice
now establishes a context, so the cudaFree()
call isn't explicitly required to intialise a context, you can use cudaSetDevice
instead. Also note that some set-up time will still be incurred at the first kernel launch, whereas before this wasn't the case. For for kernel timing, it is best to include a warm-up call first before launching the kernel you will time to remove this set-up latency. It appears that the various profiling tools have enough granularity built in to avoid this without any extra API calls or kernel calls.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With