I'm trying to learn how to use CUDA with thrust and I have seen some piece of code where the printf function seems to be used from the device.
Consider this code:
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <cstdio>
struct functor
{
__host__ __device__
void operator()(int val)
{
printf("Call for value : %d\n", val);
}
};
int main()
{
thrust::host_vector<int> cpu_vec(100);
for(int i = 0 ; i < 100 ; ++i)
cpu_vec[i] = i;
thrust::device_vector<int> cuda_vec = cpu_vec; //transfer to GPU
thrust::for_each(cuda_vec.begin(),cuda_vec.end(),functor());
}
this seems to run fine and prints 100 times the message "Call for value : " followed by a number.
now if I include iostream and replace the printf line with a C++ stream-based equivalent
std::cout << "Call for value : " << val << std::endl;
I get compilation warnings from nvcc and the compiled program will not print anything.
warning: address of a host variable "std::cout" cannot be directly taken in a device function
warning: calling a __host__ function from a __host__ __device__ function is not allowed
warning: calling a __host__ function("std::basic_ostream<char, std::char_traits<char> >::operator <<") from a __host__ __device__ function("functor::operator ()") is not allowed
- Why does it work with printf?
Because NVIDIA added runtime support for in-kernel printf for all hardware which supports the device ABI (compute capability >= 2.0). There is a template overload of the host printf
in device code which provides (almost) standard C style printf
functionality. You must include cstdio
or stdio.h
in your device code for this mechanism to work.
- Why does it not work with cout?
Because NVIDIA haven't implemented any form of C++ iostream style I/O support within the CUDA device runtime.
- What is actually run on GPU?
The device runtime maintains a FIFO buffer for kernel code to write to via printf calls during kernel execution. The device buffer is copied by the CUDA driver and echoed to stdout at the end of kernel execution. The exact heuristics and mechanism are not documented, but I would assume that format strings and output are stored to the FIFO buffer and then parsed by the CPU driver and then printed via some sort of callback from the kernel launch API. The runtime API provides a function for controlling the size of the printf FIFO.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With