Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use thrust min_element algorithm without memcpys between device and host

Tags:

cuda

thrust

I am optimising a pycuda / thrust program. In it, I use thrust::min_element to identify the index of the minimum element in an array that is on the device.

Using Nvidia's visual profiler, it appears that whenever I call thrust::min_element, there is a DtoH (device to host) memcpy. What I would like is for everything to be conducted only on the device. In other words, the output of min_element() should be stored on the device, where I can use it later, without suffering the cost of the small DtoH memcpy. Is there a way to do this? Or am I thinking about things the wrong way?

My attempt to do this is below, where the idea is to place the index of the smallest element in the array pointed at by input_ptr into the first element of the array pointed to by output_ptr. Everything should be done on the device, nothing on the host.

This code produces the right answer, but involving unwanted memcpys. Many thanks in advance for any help you can provide.

#include <thrust/extrema.h>
#include <thrust/device_vector.h>
#include <cuda.h>

void my_min_element(CUdeviceptr input_ptr, int length, CUdeviceptr output_ptr)
{
  thrust::device_ptr<float> i_ptr((float*)input_ptr);
  thrust::device_ptr<int> o_ptr((int*)output_ptr);
  o_ptr[0] = thrust::distance(i_ptr,thrust::min_element(i_ptr, i_ptr+length));
}
like image 414
weemattisnot Avatar asked Nov 10 '22 13:11

weemattisnot


1 Answers

I have found a (disappointing) answer to my own question:

I found this quote from someone on the CUDA development team [link]

"I am not a Thrust expert, so take this feedback with a grain of salt; but I think this design element of Thrust deserves to be revisited. Thrust is expressive and useful in ways that sometimes are undermined by the emphasis on returning results to the host. I've had plenty of occasions where I wanted to do an operation strictly in device memory, so Thrust's predisposition toward returning a value to host memory actually got in the way; and if I want results returned to the host, I can always pass in a mapped device pointer (which, if UVA is in effect, means any host pointer that was allocated by CUDA)"

..so it looks like I may be out of luck. If so, what a design flaw in thrust!

like image 146
weemattisnot Avatar answered Dec 21 '22 08:12

weemattisnot