Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does .loc behave differently depending on whether values are printed or assigned?

I got confused about the following behavior. When I have a dataframe like this:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'), index=list('bcdefg'))

which looks as follows:

          A         B         C         D
b -0.907325  0.211740  0.150066 -0.240011
c -0.307543  0.691359 -0.179995 -0.334836
d  1.280978  0.469956 -0.912541  0.487357
e  1.447153 -0.087224 -0.176256  1.319822
f  0.660994 -0.289151  0.956900 -1.063623
g -1.880520  1.099098 -0.759683 -0.657774

I receive the expected error

TypeError: cannot do slice indexing on with these indexers [3] of type 'int'

when I try the following slice using .loc:

print df.loc[3:, ['C', 'D']]

It is expected as I pass an integer as an index and not one of the letters contained in the index.

However, if I now try

df.loc[3:, ['C', 'D']] = 10

it works fine and gives me the output:

          A         B          C          D
b -0.907325  0.211740   0.150066  -0.240011
c -0.307543  0.691359  -0.179995  -0.334836
d  1.280978  0.469956  -0.912541   0.487357
e  1.447153 -0.087224  10.000000  10.000000
f  0.660994 -0.289151  10.000000  10.000000
g -1.880520  1.099098  10.000000  10.000000

My question is why the same command fails when something is printed and why it works when a value is assigned. When I check the doc string for .loc, I would have expected that this would always result in the error mentioned above (see especially the bold part):

Allowed inputs are:

  • A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and **never as an integer position along the index**).
  • A list or array of labels, e.g. ['a', 'b', 'c'].
  • A slice object with labels, e.g. 'a':'f' (note that contrary to usual python slices, both the start and the stop are included!).
  • A boolean array.
  • A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above)

.loc will raise a KeyError when the items are not found.

Any explanation for that; what am I missing here?

EDIT

In this question similar behavior is considered a bug which was fixed in 0.13. I use 0.19.1.

EDIT 2 Building up on @EdChum's post, one can do the following:

df.loc[2] = 20
df.loc[3] = 30
df.loc[4] = 40

which yields

           A          B          C          D
b   0.083326  -1.047032   0.830499  -0.729662
c   0.942744  -0.535013   0.809251   1.132983
d  -0.074918   1.123331  -2.205294  -0.497468
e   0.213349   0.694366  -0.816550   0.496324
f   0.021347   0.917340  -0.595254  -0.392177
g  -1.149890   0.965645   0.172672  -0.043652
2  20.000000  20.000000  20.000000  20.000000
3  30.000000  30.000000  30.000000  30.000000
4  40.000000  40.000000  40.000000  40.000000

However, that is then still confusing to me because while

print df.loc['d':'f', ['C', 'D']]

works fine, the command

print df.loc[2:4, ['C', 'D']]

gives the index error mentioned above.

Additionally, when one now assigns values like this

df.loc[2:4, ['C', 'D']] = 100

the dataframe looks as follows:

           A          B           C           D
b   0.083326  -1.047032    0.830499   -0.729662
c   0.942744  -0.535013    0.809251    1.132983
d  -0.074918   1.123331  100.000000  100.000000
e   0.213349   0.694366  100.000000  100.000000
f   0.021347   0.917340   -0.595254   -0.392177
g  -1.149890   0.965645    0.172672   -0.043652
2  20.000000  20.000000   20.000000   20.000000
3  30.000000  30.000000   30.000000   30.000000
4  40.000000  40.000000   40.000000   40.000000

So the values are not added where one - or at least I - would expect them to be added (the position rather than the label is used).

like image 899
Cleb Avatar asked Jan 25 '17 18:01

Cleb


People also ask

What does .loc mean in Python?

Python loc() function The loc() function is label based data selecting method which means that we have to pass the name of the row or column which we want to select.

How are ILOC and loc different?

The main distinction between the two methods is: loc gets rows (and/or columns) with particular labels. iloc gets rows (and/or columns) at integer locations.

What's the difference between loc and ILOC in Pandas?

When it comes to selecting rows and columns of a pandas DataFrame, loc and iloc are two commonly used functions. Here is the subtle difference between the two functions: loc selects rows and columns with specific labels. iloc selects rows and columns at specific integer positions.

Is ILOC slower than loc?

iloc[[ id ]] (with a single-element list) takes 489 ms, almost half a second, 1,800x times slower than the identical .


1 Answers

I don't think this is a bug rather undocumented semantics, for instance setting with enlargement is allowed for the simple case where the row label doesn't exist:

In [22]:
df.loc[3] = 10
df

Out[22]:
           A          B          C          D
b  -0.907325   0.211740   0.150066  -0.240011
c  -0.307543   0.691359  -0.179995  -0.334836
d   1.280978   0.469956  -0.912541   0.487357
e   1.447153  -0.087224  -0.176256   1.319822
f   0.660994  -0.289151   0.956900  -1.063623
g  -1.880520   1.099098  -0.759683  -0.657774
3  10.000000  10.000000  10.000000  10.000000

and if we pass a slice the labels aren't found in the slice but as it's an integer slice it gets converted to an ordinal slice:

In [24]:
df.loc[3:5] = 9
df

Out[24]:
           A          B          C          D
b  -0.907325   0.211740   0.150066  -0.240011
c  -0.307543   0.691359  -0.179995  -0.334836
d   1.280978   0.469956  -0.912541   0.487357
e   9.000000   9.000000   9.000000   9.000000
f   9.000000   9.000000   9.000000   9.000000
g  -1.880520   1.099098  -0.759683  -0.657774
3  10.000000  10.000000  10.000000  10.000000

the post you linked and the bug was referring to selection without assignment where a non-existent label is being passed which should raise a KeyError, which is different here

If we look at __setitem__:

def __setitem__(self, key, value):
        key = com._apply_if_callable(key, self)

        # see if we can slice the rows
        indexer = convert_to_index_sliceable(self, key))

Here it will try to convert the slice calling convert_to_index_sliceable:

def convert_to_index_sliceable(obj, key):
    """if we are index sliceable, then return my slicer, otherwise return None
    """
    idx = obj.index
    if isinstance(key, slice):
        return idx._convert_slice_indexer(key, kind='getitem')

If we look at the docstrings for this:

Signature: df.index._convert_slice_indexer(key, kind=None) Docstring: convert a slice indexer. disallow floats in the start/stop/step

Parameters ---------- key : label of the slice bound kind : {'ix', 'loc', 'getitem', 'iloc'} or None

and then run this:

In [29]:
df.index._convert_slice_indexer(slice(3,5),'loc')

Out[29]:
slice(3, 5, None)

this is then used to slice the index:

In [28]:
df.index[df.index._convert_slice_indexer(slice(3,5),'loc')]

Out[28]:
Index(['e', 'f'], dtype='object')

So we see that even though you passed what appeared to be non-existent labels, the integer slice object was converted into an ordinal slice that was compatible with the df according to different rules

like image 187
EdChum Avatar answered Oct 30 '22 11:10

EdChum