I have seen something like this a few times in the Pandas source:
def nancorr(ndarray[float64_t, ndim=2] mat, bint cov=0, minp=None):
# ...
N, K = (<object> mat).shape
This implies that a NumPy ndarray
called mat
is type-casted to a Python object.*
Upon further inspection, it seems like this is used because a compile error arises if it isn't. My question is: why is this type-cast required in the first place?
Here are a few examples. This answer suggest simply that tuple packing doesn't work in Cython like it does in Python---but it doesn't seem to be a tuple unpacking issue. (It is a fine answer regardless, and I don't mean to pick on it.)
Take the following script, shape.pyx
. It will fail at compile time with "Cannot convert 'npy_intp *' to Python object."
from cython cimport Py_ssize_t
import numpy as np
from numpy cimport ndarray, float64_t
cimport numpy as cnp
cnp.import_array()
def test_castobj(ndarray[float64_t, ndim=2] arr):
cdef:
Py_ssize_t b1, b2
# Tuple unpacking - this will fail at compile
b1, b2 = arr.shape
return b1, b2
But again, the issue does not seem to be tuple unpacking, per se. This will fail with the same error.
def test_castobj(ndarray[float64_t, ndim=2] arr):
cdef:
# Py_ssize_t b1, b2
ndarray[float64_t, ndim=2] zeros
zeros = np.zeros(arr.shape, dtype=np.float64)
return zeros
Seemingly, no tuple unpacking is happening here. A tuple is the first arg to np.zeros
.
def test_castobj(ndarray[float64_t, ndim=2] arr):
"""This works"""
cdef:
Py_ssize_t b1, b2
ndarray[float64_t, ndim=2] zeros
b1, b2 = (<object> arr).shape
zeros = np.zeros((<object> arr).shape, dtype=np.float64)
return b1, b2, zeros
This also works (perhaps the most confusing of all):
def test_castobj(object[float64_t, ndim=2] arr):
cdef:
tuple shape = arr.shape
ndarray[float64_t, ndim=2] zeros
zeros = np.zeros(shape, dtype=np.float64)
return zeros
Example:
>>> from shape import test_castobj
>>> arr = np.arange(6, dtype=np.float64).reshape(2, 3)
>>> test_castobj(arr)
(2, 3, array([[0., 0., 0.],
[0., 0., 0.]]))
*Perhaps it has something to do with arr
being a memoryview? But that's a shot in the dark.
Another example is in the Cython docs:
cpdef int sum3d(int[:, :, :] arr) nogil:
cdef size_t i, j, k
cdef int total = 0
I = arr.shape[0]
J = arr.shape[1]
K = arr.shape[2]
In this case, simply indexing arr.shape[i]
prevents the error, which I find strange.
This also works:
def test_castobj(object[float64_t, ndim=2] arr):
cdef ndarray[float64_t, ndim=2] zeros
zeros = np.zeros(arr.shape, dtype=np.float64)
return zeros
You are right, it has nothing to do with the tuple-unpacking under Cython.
The reason is, that cnp.ndarray
isn't an usual numpy-array (that means a numpy-array with interface known from python), but a Cython wrapper of the numpy's C-implementation for PyArrayObject
(which is known as np.array
in Python):
ctypedef class numpy.ndarray [object PyArrayObject]:
cdef __cythonbufferdefaults__ = {"mode": "strided"}
cdef:
# Only taking a few of the most commonly used and stable fields.
# One should use PyArray_* macros instead to access the C fields.
char *data
int ndim "nd"
npy_intp *shape "dimensions"
npy_intp *strides
dtype descr
PyObject* base
shape
maps in reality to dimensions
-field (npy_intp *shape "dimensions"
instead of simply npy_intp *dimensions
) of the underlying C-stuct. It is a trick, so one can write
mat.shape[0]
and it has the looks (and to some degree the feel) as if numpy's python-property shape
is called. But in reality a shortcut directly to the underlying C-stuct is taken.
Btw calling python-shape
is quite costly: a tuple must be created and filled with values from dimensions
, then the 0-th element is accessed. On the other hand, Cython's way of doing it is much cheaper - just access the right element.
However, if you yet want access the python-property of the array, you have to cast it to a normal python-object (i.e. forget that this is a ndarray
) and then shape
is resolved to the tuple-property call via the usual Python-mechanism.
So basically, even if this is convenient, you don't want to access the dimensions of the numpy array in a tight loop the way it is done in the pandas-code, instead you would do the more verbose variant for performance:
...
N=mat.shape[0]
K=mat.shape[1]
...
Why you can write object[cnp.float64_t]
or similar in the function signature strikes me as strange - the parameter is then obviously interpreted as a simple object. Maybe this is just a bug.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With