I have a Python function (named apply_filter), whose execution may involve either the CPU (using NumPy) and GPU (using CuPy). The function takes an input-buffer object, represting a pointer to data either in system memory or on the GPU's global device memory.
I want to invoke this from a C++ code using the Python C API. In order to do so, I need to supply the function, on the C++ side with something to pass as the input-buffer object - which in my case will correspond to a raw pointer. But I'm not sure how to do this.
Here is a simplified version of my code:
The invoking code, in C++:
#include <Python.h>
void PythonObjectWrapper::applyFilter(float* image, std::array<int, 3> dim) {
PyObject* python_method = PyObject_GetAttrString(class_object_, method_name_);
PyObject* py_image = ??? // convert C-array to PyObject
PyObject* method_args = PyTuple_New(2);
PyTuple_SetItem(method_args, 0, py_image);
PyTuple_SetItem(method_args, 1, ...); // transfer dim
PyObject* py_filtered_image = PyObject_CallObject(python_method, method_args);
float* filtered_image = ??? // convert PyObject to C-array
}
The invoked function, in Python:
class Filter:
def __init__(self, gpu):
self.gpu_ = gpu
def apply_filter(self, image_ptr, dim)
image_array = ??? // convert image_ptr PyObject to NumPy / CuPy array
apply_filter_(image_array)
filtered_image_ptr = ??? // convert image_array to ptr
return filtered_image_ptr
How do I complete the 4 lines marked with ????
Bonus points for a solution avoiding any unnecessary copies (especially from Host to Device in some direction) and do everything efficiently and will support both run modes (CPU/GPU) in a robust manner.
This solution may not be optimal or the most efficient, but it does work:
There is a delicate way to handle each of the 4 ??? signs you spread out your code. Let's go over them in order -
Convert C-ptr to PyObject on Host
A convenient way to do so is to use PyByteArray:
PyByteArray_FromStringAndSize(
reinterpret_cast<char *>(image),
sizeof(float) * dim[0] * dim[1] * dim[2]);
Convert C-ptr to PyObject on Device
In this case, PyByteArray won't deliver the goods, since it is only suitable for continuous memory on the Host. A convenient way to wrap a raw pointer as a PyObject is PyCapsule, which can be initialized as follows -
PyCapsule_New(reinterpret_cast<void *>(image), "image", NULL);
Note that the destructor is not needed here (sends NULL) since the C-code is in charge of this allocated device memory.
Convert PyObject to Numpy Array
The PyByteArray points to contiguous memory on the Host, and can thus be read as a simple buffer by NumPy using -
image_buffer = np.frombuffer(
image_ptr,
dtype=np.float32,
count=dims[0] * dims[1] * dims[2])
image_array = np
.asarray(image_buffer, type=np.float32)
.reshape(dims[2], dims[1], dims[0])
.transpose(1, 2, 0)
The reshape and transpose operations are needed in order to convert the array shape from C-order (as used by C++) to Fortran-order (as used by Numpy).
Convert PyObject to CuPy Array
So that's probably the most tricky one. You need to use the Python C-API directly (using ctypes.pythonapi) in order to unpack the pointer, and then some CuPy utilities to transform it into an array. The PyCapsulte_GetPointer method is not compatible with the exact way our PyCapsule was created (I still do not completely understand why), and thus requires manual re-definition of the expected restype and argtypes.
First, we need to open the PyCapsule obtaining the raw pointer on the device -
ctypes.pythonapi.PyCapsule_GetPointer.restype = ctypes.c_void_p
ctypes.pythonapi.PyCapsule_GetPointer.argtypes =
[ctypes.py_object, ctypes.c_void_p]
raw_address = ctypes.pythonapi.PyCapsule_GetPointer(
image_ptr, self.pycapsule_name_.encode('utf-8'))
raw_ptr = ct.c_void_p(raw_address)
Now, we need to define the CuPy array based on this raw_ptr with the appropriate size -
mem = cp.cuda.MemoryPointer(
cp.cuda.UnownedMemory(
raw_ptr.value,
dims[0] * dims[1] * dims[2] * cp.dtype(cp.float32).itemsize,
None),
0)
cupy_array = cp.ndarray(dims, dtype=cp.float32, memptr=mem)
cupy_array = cp
.asarray(cupy_array, dtype=cp.float32)
.reshape(dims[2], dims[1], dims[0])
image_array = cp.transpose(cupy_array, axes=(1, 2, 0))
And that's it (for the input...)! Now you can robustly write your code (using either np or cp prefix using an appropriate wrapper) to work on both CPU and GPU.
Oh, you also want to return this array as a raw pointer back to C++? This raises some more complications:
Convert NumPy Array to PyObject
That's easy, simply
filtered_image_ptr = image_array.copy(order='C').data
Convert CuPy Array to PyOjbect
Here you need to again wrap your raw pointer as a PyCapsule. Again you need to redefine the restype and argtypes of the Python C-API methods.
ctypes.pythonapi.PyCapsule_New.restype = ctypes.py_object
PyCapsule_Destructor = ctypes.CFUNCTYPE(None, ctypes.py_object)
ctypes.pythonapi.PyCapsule_New.argtypes =
[ctypes.c_void_p, ctypes.c_char_p, PyCapsule_Destructor]
image_raw_ptr = ctypes.c_void_p(image_array.data.ptr)
name = ctypes.c_char_p(f"b'{self.pycapsule_name_}'")
filtered_image_ptr = ctypes.pythonapi.PyCapsule_New(image_raw_ptr, name, PyCapsule_Destructor(0))
Convert PyObject to C-ptr on Host
You can unpack the value returned from the NumPy array as a Py_buffer.
Py_buffer buffer;
PyObject_GetBuffer(py_filtered_image, &buffer, PyBUF_FORMAT);
memcpy(
filtered_image,
buffer.buf, dim[0] * dim[1] * dim[2] * sizeof(float));
Convert PyObject to C-ptr on Device
Simply unpack the PyCapsule. Here for some reason there's no need for redefinition of restype and argtypes.
auto* filtered_image_ptr = reinterpret_cast<float*>(
PyCapsule_GetPointer(py_filtered_image, "slices"));
cudaMemcpy(
filtered_image,
filtered_image_ptr, dim_.Volume() * sizeof(float),
cudaMemcpyHostToHost);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With