I am slicing a pandas dataframe and I seem to be getting unexpected slices using .loc
, at least as compared to numpy and ordinary python slicing. See the example below.
>>> import pandas as pd
>>> a = pd.DataFrame([[0,1,2],[3,4,5],[4,5,6],[9,10,11],[34,2,1]])
>>> a
0 1 2
0 0 1 2
1 3 4 5
2 4 5 6
3 9 10 11
4 34 2 1
>>> a.loc[1:3, :]
0 1 2
1 3 4 5
2 4 5 6
3 9 10 11
>>> a.values[1:3, :]
array([[3, 4, 5],
[4, 5, 6]])
Interestingly, this only happens with .loc
, not .iloc
.
>>> a.iloc[1:3, :]
0 1 2
1 3 4 5
2 4 5 6
Thus, .loc
appears to be inclusive of the terminating index, but numpy and .iloc
are not.
By the comments, it seems this is not a bug and we are well warned. But why is it the case?
Remember .loc
is primarily label based indexing. The decision to include the stop endpoint becomes far more obvious when working with a non-RangeIndex:
df = pd.DataFrame([1,2,3,4], index=list('achz'))
# 0
#a 1
#c 2
#h 3
#z 4
If I want to select all rows between 'a'
and 'h'
(inclusive) I only know about 'a'
and 'h'
. In order to be consistent with other python slicing, you'd need to also know what index follows 'h'
, which in this case is 'z'
but could have been anything.
There's also a section of the documents hidden away that explains this design choice Endpoints are Inclusive
Additionally to the point in the docs, pandas
slice indexing using .loc
is not cell index based. It is in fact value based indexing (in the pandas docs it is called "label based", but for numerical data I prefer the term "value based"), whereas with .iloc
it is traditional numpy-style cell indexing.
Furthermore, value based indexing is right-inclusive, whereas cell indexing is not. Just try the following:
a = pd.DataFrame([[0,1,2],[3,4,5],[4,5,6],[9,10,11],[34,2,1]])
a.index = [0, 1, 2, 3.1, 4] # add a float index
# value based slicing: the following will output all value up to the slice value
a.loc[1:3.1]
# Out:
# 0 1 2
# 1.0 3 4 5
# 2.0 4 5 6
# 3.1 9 10 11
# index based slicing: will raise an error, since only integers are allowed
a.iloc[1:3.1]
# Out: TypeError: cannot do slice indexing on <class 'pandas.core.indexes.numeric.Float64Index'> with these indexers [3.2] of <class 'float'>
To give an explicit answer to your question why it is right-inclusive:
When using values/labels as indices, it is, at least in my opinion, intuitive, that the last index is included. This is as far as I know a design decision of how the implemented function is meant to work.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With