I have several series of lists of variable length with some nulls. One example is:
In [108]: s0 = pd.Series([['a', 'b'],['c'],np.nan])
In [109]: s0
Out[109]:
0 [a, b]
1 [c]
2 NaN
dtype: object
but another contains all NaNs
:
In [110]: s1 = pd.Series([np.nan,np.nan])
In [111]: s1
Out[111]:
0 NaN
1 NaN
dtype: float64
I need the last item in each list, which is straightforward:
In [112]: s0.map(lambda x: x[-1] if isinstance(x,list) else x)
Out[112]:
0 b
1 c
2 NaN
dtype: object
But whilst getting to this I discovered that, without the isinstance
, when the indexing chokes on the NaNs
it does so differently on s0
and s1
:
In [113]: s0.map(lambda x: x[-1])
...
TypeError: 'float' object is not subscriptable
In [114]: s1.map(lamda x: x[-1])
...
IndexError: invalid index to scalar variable.
Can anyone explain why? Is this a bug? I'm using Pandas 0.16.2 and Python 3.4.3.
At its core, this is really a NumPy issue rather than a pandas issue.
map
iterates over the values in the column to pass them to the lambda
function one at a time. Underneath, columns/Series in pandas are just (slices of) NumPy arrays, so pandas defines the following helper function to get the value out of the underlying array for the function. This is called by map
on each iteration:
PANDAS_INLINE PyObject*
get_value_1d(PyArrayObject* ap, Py_ssize_t i) {
char *item = (char *) PyArray_DATA(ap) + i * PyArray_STRIDE(ap, 0);
return PyArray_Scalar(item, PyArray_DESCR(ap), (PyObject*) ap);
}
The key bit is PyArray_Scalar
, which is a NumPy API function that copies a section of a NumPy array out to return a scalar value.
The code that makes up the function is too long to post here, but here's where to find it in the codebase. All we need to know is that the scalar it returns will match the dtype of the array it's used on.
Back to your Series: s0
has object
dtype while s1
has float64
dtype. This means that PyArray_Scalar
will return a different type of scalar for each Series; an actual Python float
object and a NumPy scalar float object respectively:
>>> type(s0[2])
float
>>> type(s1[0])
numpy.float64
The NaN
values are returned as two different types, hence the different errors when you try to index into them using the lambda
function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With