I've read the docs about slicers a million times, but have never got my head round it, so I'm still trying to figure out how to use loc
to slice a DataFrame
with a MultiIndex
.
I'll start with the DataFrame
from this SO answer:
value first second third fourth A0 B0 C1 D0 2 D1 3 C2 D0 6 D1 7 B1 C1 D0 10 D1 11 C2 D0 14 D1 15 A1 B0 C1 D0 18 D1 19 C2 D0 22 D1 23 B1 C1 D0 26 D1 27 C2 D0 30 D1 31 A2 B0 C1 D0 34 D1 35 C2 D0 38 D1 39 B1 C1 D0 42 D1 43 C2 D0 46 D1 47 A3 B0 C1 D0 50 D1 51 C2 D0 54 D1 55 B1 C1 D0 58 D1 59 C2 D0 62 D1 63
To select just A0
and C1
values, I can do:
In [26]: df.loc['A0', :, 'C1', :] Out[26]: value first second third fourth A0 B0 C1 D0 2 D1 3 B1 C1 D0 10 D1 11
Which also works selecting from three levels, and even with tuples:
In [28]: df.loc['A0', :, ('C1', 'C2'), 'D1'] Out[28]: value first second third fourth A0 B0 C1 D1 3 C2 D1 5 B1 C1 D1 11 C2 D1 13
So far, intuitive and brilliant.
So why can't I select all values from the first index level?
In [30]: df.loc[:, :, 'C1', :] --------------------------------------------------------------------------- IndexingError Traceback (most recent call last) <ipython-input-30-57b56108d941> in <module>() ----> 1 df.loc[:, :, 'C1', :] /usr/local/lib/python2.7/dist-packages/pandas/core/indexing.pyc in __getitem__(self, key) 1176 def __getitem__(self, key): 1177 if type(key) is tuple: -> 1178 return self._getitem_tuple(key) 1179 else: 1180 return self._getitem_axis(key, axis=0) /usr/local/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _getitem_tuple(self, tup) 694 695 # no multi-index, so validate all of the indexers --> 696 self._has_valid_tuple(tup) 697 698 # ugly hack for GH #836 /usr/local/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _has_valid_tuple(self, key) 125 for i, k in enumerate(key): 126 if i >= self.obj.ndim: --> 127 raise IndexingError('Too many indexers') 128 if not self._has_valid_type(k, i): 129 raise ValueError("Location based indexing can only have [%s] " IndexingError: Too many indexers
Surely this is not intended behaviour?
Note: I know this is possible with df.xs('C1', level='third')
but the current .loc
behaviour seems inconsistent.
loc . I have a DataFrame with 4.8 million rows, and selecting a single row using . iloc[[ id ]] (with a single-element list) takes 489 ms, almost half a second, 1,800x times slower than the identical .
The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.
loc and iloc are interchangeable when labels are 0-based integers.
When it comes to selecting rows and columns of a pandas DataFrame, loc and iloc are two commonly used functions. Here is the subtle difference between the two functions: loc selects rows and columns with specific labels. iloc selects rows and columns at specific integer positions.
The reason this doesn't work is tied to the need to specify the axis of indexing (mentioned in http://pandas.pydata.org/pandas-docs/stable/advanced.html). An alternative solution to your problem is to simply do this:
df.loc(axis=0)[:, :, 'C1', :]
Pandas gets confused sometimes when indexes are similar or contain similar values. If you were to have a column named 'C1' or something you would also need to do this under this style of slicing/selecting.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With