I have a data frame with index (year
, foo
), where I would like to select the X largest observations of foo
where year == someYear
.
My approach was
df.sort_index(level=[0, 1], ascending=[1, 0], inplace=True)
df.loc[pd.IndexSlice[2002, :10], :]
but I get
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (0)'
I tried different variants of sorting (e.g. ascending = [0, 0]
), but they all resulted in some sort of error.
If I only wanted the xth
row, I could df.groupby(level=[0]).nth(x)
after sorting, but since I want a set of rows, that doesn't feel quite efficient.
What's the best way to select these rows? Some data to play with:
rank_int rank
year foo
2015 1.381845 2 320
1.234795 2 259
1.148488 199 2
0.866704 2 363
0.738022 2 319
You can slice a MultiIndex by providing multiple indexers. You can provide any of the selectors as if you are indexing by label, see Selection by Label, including slices, lists of labels, labels, and boolean indexers. You can use slice(None) to select all the contents of that level.
A multi-index dataframe has multi-level, or hierarchical indexing. We can easily convert the multi-level index into the column by the reset_index() method. DataFrame. reset_index() is used to reset the index to default and make the index a column of the dataframe.
ascending
should be a boolean or a list of booleans, not a list of integers. Try sorting this way:
df.sort_index(ascending=True, inplace=True)
Firstly you should do sorting like this:
df.sort_index(level=['year','foo'], ascending=[1, 0], inplace=True)
It should fix the KeyError. But df.loc[pd.IndexSlice[2002, :10], :]
won't give you the result you are expecting. The loc function is not iloc and it will try to find in foo indexes 0,1..9. The secondary levels of Multiindex do not support iloc, I would suggest using groupby. If you already have this multiindex you should do:
df.reset_index()
df = df.sort_values(by=['year','foo'],ascending=[True,False])
df.groupby('year').head(10)
If you need n entries with the least foo you can use tail(n)
. If you need, say, the first, third and fifth entries, you can use nth([0,2,4])
as you mentioned in the question.
I think it's the most efficient way one could do it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With