Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cython: Why do NumPy arrays need to be type-cast to object?

I have seen something like this a few times in the Pandas source:

def nancorr(ndarray[float64_t, ndim=2] mat, bint cov=0, minp=None):
    # ...
    N, K = (<object> mat).shape

This implies that a NumPy ndarray called mat is type-casted to a Python object.*

Upon further inspection, it seems like this is used because a compile error arises if it isn't. My question is: why is this type-cast required in the first place?

Here are a few examples. This answer suggest simply that tuple packing doesn't work in Cython like it does in Python---but it doesn't seem to be a tuple unpacking issue. (It is a fine answer regardless, and I don't mean to pick on it.)

Take the following script, shape.pyx. It will fail at compile time with "Cannot convert 'npy_intp *' to Python object."

from cython cimport Py_ssize_t
import numpy as np
from numpy cimport ndarray, float64_t
cimport numpy as cnp
cnp.import_array()

def test_castobj(ndarray[float64_t, ndim=2] arr):

    cdef:
        Py_ssize_t b1, b2

    # Tuple unpacking - this will fail at compile
    b1, b2 = arr.shape
    return b1, b2

But again, the issue does not seem to be tuple unpacking, per se. This will fail with the same error.

def test_castobj(ndarray[float64_t, ndim=2] arr):

    cdef:
        # Py_ssize_t b1, b2
        ndarray[float64_t, ndim=2] zeros

    zeros = np.zeros(arr.shape, dtype=np.float64)
    return zeros

Seemingly, no tuple unpacking is happening here. A tuple is the first arg to np.zeros.

def test_castobj(ndarray[float64_t, ndim=2] arr):
    """This works"""
    cdef:
        Py_ssize_t b1, b2
        ndarray[float64_t, ndim=2] zeros

    b1, b2 = (<object> arr).shape
    zeros = np.zeros((<object> arr).shape, dtype=np.float64)
    return b1, b2, zeros

This also works (perhaps the most confusing of all):

def test_castobj(object[float64_t, ndim=2] arr):
    cdef:
        tuple shape = arr.shape
        ndarray[float64_t, ndim=2] zeros
    zeros = np.zeros(shape, dtype=np.float64)
    return zeros

Example:

>>> from shape import test_castobj
>>> arr = np.arange(6, dtype=np.float64).reshape(2, 3)

>>> test_castobj(arr)
(2, 3, array([[0., 0., 0.],
        [0., 0., 0.]]))

*Perhaps it has something to do with arr being a memoryview? But that's a shot in the dark.


Another example is in the Cython docs:

cpdef int sum3d(int[:, :, :] arr) nogil:
    cdef size_t i, j, k
    cdef int total = 0
    I = arr.shape[0]
    J = arr.shape[1]
    K = arr.shape[2]

In this case, simply indexing arr.shape[i] prevents the error, which I find strange.

This also works:

def test_castobj(object[float64_t, ndim=2] arr):
    cdef ndarray[float64_t, ndim=2] zeros
    zeros = np.zeros(arr.shape, dtype=np.float64)
    return zeros
like image 850
Brad Solomon Avatar asked Nov 08 '22 03:11

Brad Solomon


1 Answers

You are right, it has nothing to do with the tuple-unpacking under Cython.

The reason is, that cnp.ndarray isn't an usual numpy-array (that means a numpy-array with interface known from python), but a Cython wrapper of the numpy's C-implementation for PyArrayObject (which is known as np.array in Python):

ctypedef class numpy.ndarray [object PyArrayObject]:
    cdef __cythonbufferdefaults__ = {"mode": "strided"}

    cdef:
        # Only taking a few of the most commonly used and stable fields.
        # One should use PyArray_* macros instead to access the C fields.
        char *data
        int ndim "nd"
        npy_intp *shape "dimensions"
        npy_intp *strides
        dtype descr
        PyObject* base

shape maps in reality to dimensions-field (npy_intp *shape "dimensions" instead of simply npy_intp *dimensions) of the underlying C-stuct. It is a trick, so one can write

mat.shape[0]

and it has the looks (and to some degree the feel) as if numpy's python-property shape is called. But in reality a shortcut directly to the underlying C-stuct is taken.

Btw calling python-shape is quite costly: a tuple must be created and filled with values from dimensions, then the 0-th element is accessed. On the other hand, Cython's way of doing it is much cheaper - just access the right element.

However, if you yet want access the python-property of the array, you have to cast it to a normal python-object (i.e. forget that this is a ndarray) and then shape is resolved to the tuple-property call via the usual Python-mechanism.

So basically, even if this is convenient, you don't want to access the dimensions of the numpy array in a tight loop the way it is done in the pandas-code, instead you would do the more verbose variant for performance:

...
N=mat.shape[0]
K=mat.shape[1]
...

Why you can write object[cnp.float64_t] or similar in the function signature strikes me as strange - the parameter is then obviously interpreted as a simple object. Maybe this is just a bug.

like image 166
ead Avatar answered Nov 14 '22 21:11

ead