Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is .loc slicing in pandas inclusive of stop, contrary to typical python slicing?

Tags:

python

pandas

I am slicing a pandas dataframe and I seem to be getting unexpected slices using .loc, at least as compared to numpy and ordinary python slicing. See the example below.

>>> import pandas as pd
>>> a = pd.DataFrame([[0,1,2],[3,4,5],[4,5,6],[9,10,11],[34,2,1]])
>>> a
    0   1   2
0   0   1   2
1   3   4   5
2   4   5   6
3   9  10  11
4  34   2   1
>>> a.loc[1:3, :]
   0   1   2
1  3   4   5
2  4   5   6
3  9  10  11
>>> a.values[1:3, :]
array([[3, 4, 5],
       [4, 5, 6]])

Interestingly, this only happens with .loc, not .iloc.

>>> a.iloc[1:3, :]
   0  1  2
1  3  4  5
2  4  5  6

Thus, .loc appears to be inclusive of the terminating index, but numpy and .iloc are not.

By the comments, it seems this is not a bug and we are well warned. But why is it the case?

like image 899
jtorca Avatar asked Mar 15 '19 17:03

jtorca


2 Answers

Remember .loc is primarily label based indexing. The decision to include the stop endpoint becomes far more obvious when working with a non-RangeIndex:

df = pd.DataFrame([1,2,3,4], index=list('achz'))
#   0
#a  1
#c  2
#h  3
#z  4

If I want to select all rows between 'a' and 'h' (inclusive) I only know about 'a' and 'h'. In order to be consistent with other python slicing, you'd need to also know what index follows 'h', which in this case is 'z' but could have been anything.


There's also a section of the documents hidden away that explains this design choice Endpoints are Inclusive

like image 137
ALollz Avatar answered Nov 14 '22 21:11

ALollz


Additionally to the point in the docs, pandas slice indexing using .loc is not cell index based. It is in fact value based indexing (in the pandas docs it is called "label based", but for numerical data I prefer the term "value based"), whereas with .iloc it is traditional numpy-style cell indexing.

Furthermore, value based indexing is right-inclusive, whereas cell indexing is not. Just try the following:

a = pd.DataFrame([[0,1,2],[3,4,5],[4,5,6],[9,10,11],[34,2,1]])
a.index = [0, 1, 2, 3.1, 4]  # add a float index

# value based slicing: the following will output all value up to the slice value
a.loc[1:3.1]
# Out:
# 0    1   2
# 1.0  3   4   5
# 2.0  4   5   6
# 3.1  9  10  11

# index based slicing: will raise an error, since only integers are allowed
a.iloc[1:3.1]
# Out: TypeError: cannot do slice indexing on <class 'pandas.core.indexes.numeric.Float64Index'> with these indexers [3.2] of <class 'float'>

To give an explicit answer to your question why it is right-inclusive:
When using values/labels as indices, it is, at least in my opinion, intuitive, that the last index is included. This is as far as I know a design decision of how the implemented function is meant to work.

like image 22
JE_Muc Avatar answered Nov 14 '22 21:11

JE_Muc