I got confused about the following behavior. When I have a dataframe like this:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'), index=list('bcdefg'))
which looks as follows:
A B C D
b -0.907325 0.211740 0.150066 -0.240011
c -0.307543 0.691359 -0.179995 -0.334836
d 1.280978 0.469956 -0.912541 0.487357
e 1.447153 -0.087224 -0.176256 1.319822
f 0.660994 -0.289151 0.956900 -1.063623
g -1.880520 1.099098 -0.759683 -0.657774
I receive the expected error
TypeError: cannot do slice indexing on with these indexers [3] of type 'int'
when I try the following slice using .loc
:
print df.loc[3:, ['C', 'D']]
It is expected as I pass an integer as an index and not one of the letters contained in the index
.
However, if I now try
df.loc[3:, ['C', 'D']] = 10
it works fine and gives me the output:
A B C D
b -0.907325 0.211740 0.150066 -0.240011
c -0.307543 0.691359 -0.179995 -0.334836
d 1.280978 0.469956 -0.912541 0.487357
e 1.447153 -0.087224 10.000000 10.000000
f 0.660994 -0.289151 10.000000 10.000000
g -1.880520 1.099098 10.000000 10.000000
My question is why the same command fails when something is printed and why it works when a value is assigned. When I check the doc string for .loc
, I would have expected that this would always result in the error mentioned above (see especially the bold part):
Allowed inputs are:
- A single label, e.g.
5
or'a'
, (note that5
is interpreted as a label of the index, and **never as an integer position along the index**).- A list or array of labels, e.g.
['a', 'b', 'c']
.- A slice object with labels, e.g.
'a':'f'
(note that contrary to usual python slices, both the start and the stop are included!).- A boolean array.
- A
callable
function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)
.loc
will raise aKeyError
when the items are not found.
Any explanation for that; what am I missing here?
EDIT
In this question similar behavior is considered a bug which was fixed in 0.13. I use 0.19.1.
EDIT 2 Building up on @EdChum's post, one can do the following:
df.loc[2] = 20
df.loc[3] = 30
df.loc[4] = 40
which yields
A B C D
b 0.083326 -1.047032 0.830499 -0.729662
c 0.942744 -0.535013 0.809251 1.132983
d -0.074918 1.123331 -2.205294 -0.497468
e 0.213349 0.694366 -0.816550 0.496324
f 0.021347 0.917340 -0.595254 -0.392177
g -1.149890 0.965645 0.172672 -0.043652
2 20.000000 20.000000 20.000000 20.000000
3 30.000000 30.000000 30.000000 30.000000
4 40.000000 40.000000 40.000000 40.000000
However, that is then still confusing to me because while
print df.loc['d':'f', ['C', 'D']]
works fine, the command
print df.loc[2:4, ['C', 'D']]
gives the index error mentioned above.
Additionally, when one now assigns values like this
df.loc[2:4, ['C', 'D']] = 100
the dataframe looks as follows:
A B C D
b 0.083326 -1.047032 0.830499 -0.729662
c 0.942744 -0.535013 0.809251 1.132983
d -0.074918 1.123331 100.000000 100.000000
e 0.213349 0.694366 100.000000 100.000000
f 0.021347 0.917340 -0.595254 -0.392177
g -1.149890 0.965645 0.172672 -0.043652
2 20.000000 20.000000 20.000000 20.000000
3 30.000000 30.000000 30.000000 30.000000
4 40.000000 40.000000 40.000000 40.000000
So the values are not added where one - or at least I - would expect them to be added (the position rather than the label is used).
Python loc() function The loc() function is label based data selecting method which means that we have to pass the name of the row or column which we want to select.
The main distinction between the two methods is: loc gets rows (and/or columns) with particular labels. iloc gets rows (and/or columns) at integer locations.
When it comes to selecting rows and columns of a pandas DataFrame, loc and iloc are two commonly used functions. Here is the subtle difference between the two functions: loc selects rows and columns with specific labels. iloc selects rows and columns at specific integer positions.
iloc[[ id ]] (with a single-element list) takes 489 ms, almost half a second, 1,800x times slower than the identical .
I don't think this is a bug rather undocumented semantics, for instance setting with enlargement is allowed for the simple case where the row label doesn't exist:
In [22]:
df.loc[3] = 10
df
Out[22]:
A B C D
b -0.907325 0.211740 0.150066 -0.240011
c -0.307543 0.691359 -0.179995 -0.334836
d 1.280978 0.469956 -0.912541 0.487357
e 1.447153 -0.087224 -0.176256 1.319822
f 0.660994 -0.289151 0.956900 -1.063623
g -1.880520 1.099098 -0.759683 -0.657774
3 10.000000 10.000000 10.000000 10.000000
and if we pass a slice the labels aren't found in the slice but as it's an integer slice it gets converted to an ordinal slice:
In [24]:
df.loc[3:5] = 9
df
Out[24]:
A B C D
b -0.907325 0.211740 0.150066 -0.240011
c -0.307543 0.691359 -0.179995 -0.334836
d 1.280978 0.469956 -0.912541 0.487357
e 9.000000 9.000000 9.000000 9.000000
f 9.000000 9.000000 9.000000 9.000000
g -1.880520 1.099098 -0.759683 -0.657774
3 10.000000 10.000000 10.000000 10.000000
the post you linked and the bug was referring to selection without assignment where a non-existent label is being passed which should raise a KeyError
, which is different here
If we look at __setitem__
:
def __setitem__(self, key, value):
key = com._apply_if_callable(key, self)
# see if we can slice the rows
indexer = convert_to_index_sliceable(self, key))
Here it will try to convert the slice calling convert_to_index_sliceable
:
def convert_to_index_sliceable(obj, key):
"""if we are index sliceable, then return my slicer, otherwise return None
"""
idx = obj.index
if isinstance(key, slice):
return idx._convert_slice_indexer(key, kind='getitem')
If we look at the docstrings for this:
Signature: df.index._convert_slice_indexer(key, kind=None) Docstring: convert a slice indexer. disallow floats in the start/stop/step
Parameters ---------- key : label of the slice bound kind : {'ix', 'loc', 'getitem', 'iloc'} or None
and then run this:
In [29]:
df.index._convert_slice_indexer(slice(3,5),'loc')
Out[29]:
slice(3, 5, None)
this is then used to slice the index:
In [28]:
df.index[df.index._convert_slice_indexer(slice(3,5),'loc')]
Out[28]:
Index(['e', 'f'], dtype='object')
So we see that even though you passed what appeared to be non-existent labels, the integer slice object was converted into an ordinal slice that was compatible with the df according to different rules
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With