Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas IndexError/TypeError inconsistency with NaN values

I have several series of lists of variable length with some nulls. One example is:

In [108]: s0 = pd.Series([['a', 'b'],['c'],np.nan])
In [109]: s0
Out[109]: 
0    [a, b]
1       [c]
2       NaN
dtype: object

but another contains all NaNs:

In [110]: s1 = pd.Series([np.nan,np.nan])
In [111]: s1
Out[111]: 
0    NaN
1    NaN
dtype: float64

I need the last item in each list, which is straightforward:

In [112]: s0.map(lambda x: x[-1] if isinstance(x,list) else x)
Out[112]: 
0      b
1      c
2    NaN
dtype: object

But whilst getting to this I discovered that, without the isinstance, when the indexing chokes on the NaNs it does so differently on s0 and s1:

In [113]: s0.map(lambda x: x[-1])
...
TypeError: 'float' object is not subscriptable

In [114]: s1.map(lamda x: x[-1])
...
IndexError: invalid index to scalar variable.

Can anyone explain why? Is this a bug? I'm using Pandas 0.16.2 and Python 3.4.3.

like image 820
majr Avatar asked Nov 08 '22 23:11

majr


1 Answers

At its core, this is really a NumPy issue rather than a pandas issue.

map iterates over the values in the column to pass them to the lambda function one at a time. Underneath, columns/Series in pandas are just (slices of) NumPy arrays, so pandas defines the following helper function to get the value out of the underlying array for the function. This is called by map on each iteration:

PANDAS_INLINE PyObject*
get_value_1d(PyArrayObject* ap, Py_ssize_t i) {
  char *item = (char *) PyArray_DATA(ap) + i * PyArray_STRIDE(ap, 0);
  return PyArray_Scalar(item, PyArray_DESCR(ap), (PyObject*) ap);
}

The key bit is PyArray_Scalar, which is a NumPy API function that copies a section of a NumPy array out to return a scalar value.

The code that makes up the function is too long to post here, but here's where to find it in the codebase. All we need to know is that the scalar it returns will match the dtype of the array it's used on.

Back to your Series: s0 has object dtype while s1 has float64 dtype. This means that PyArray_Scalar will return a different type of scalar for each Series; an actual Python float object and a NumPy scalar float object respectively:

>>> type(s0[2])
float
>>> type(s1[0])
numpy.float64

The NaN values are returned as two different types, hence the different errors when you try to index into them using the lambda function.

like image 199
Alex Riley Avatar answered Nov 14 '22 21:11

Alex Riley