After research I have found no similar questions on this or any other forum.
I'm grouping a MultiIndex dataframe by its inner level. The thing is, after grouping by I still want to know which were the "chosen values" on this inner index.
So I have something of the sort
df = pd.DataFrame([['A', 1, 3],
['A', 2, 4],
['A', 3, 6],
['B', 1, 9],
['B', 2, 10],
['B', 4, 6]],
columns=pd.Index(['Name', 'Date', 'Value'], name='ColumnName')
).set_index(['Name', 'Date'])
ColumnName Value
Name Date
A 1 3
2 4
3 6
B 1 9
2 10
4 6
What I wanted is
ColumnName Value
Name Date
A 3 6
B 4 6
What I was capable of doing was using this command:
df.groupby(level=('Name')).last()
was retrieving this:
ColumnName Value
Name
A 6
B 6
Or, by using this command:
df.groupby(level=('Name','Date')).last()
retrieving an error.
Keep in mind that this is a performance sensitive application.
Thoughts ?
EDIT: Meanwhile I did submit a feature request at GitHub
The Groupby Rolling function does not preserve the original index and so when dates are the same within the Group, it is impossible to know which index value it pertains to from the original dataframe.
Groupby preserves the order of rows within each group.
By using tail(1)
instead of last()
on the groupby object, you get the desired behaviour:
In [22]: df.groupby(level='Name').tail(1)
Out[22]:
ColumnName Value
Name Date
A 3 6
B 4 6
This is because tail
acts like a 'filter' method, keeping the original index intact (but only returning certain rows, in this case the last row of each group). last
does not do this, because this method will give you the last non-NaN value for each column in each group, not necessarily returning original rows.
OLD ANSWER (using last
): You can rather simply achieve this using groupby
by making the index level you want to retain in the groupby as a column:
In [44]: df.reset_index(level='Date').groupby(level=0).last()
Out[44]:
ColumnName Date Value
Name
A 3 6
B 4 6
And afterwards you can set it back as the index to obtain the desired result:
In [46]: df.reset_index(level='Date').groupby(level=0).last().set_index('Date', append=True)
Out[46]:
ColumnName Value
Name Date
A 3 6
B 4 6
Since there was asked about performance, it is true that the groupby solution is slower on the example dataframe:
In [96]: %timeit get_slice(df)
1000 loops, best of 3: 879 µs per loop
In [97]: %timeit df.reset_index(level='Date').groupby(level='Name').last().set_index('Date', append=True)
100 loops, best of 3: 3.75 ms per loop
In [220]: %timeit df.groupby(level='Name').tail(1)
1000 loops, best of 3: 1.04 ms per loop
But if you look at a bigger example dataframe, the difference is already much smaller (and the last
method is even faster):
In [83]: df1 = pd.DataFrame(
{'Value':np.random.randint(100, size=len(string.letters)*100)},
index=pd.MultiIndex.from_product([list(string.letters), range(100)],
names=['Name', 'Date']))
In [84]: df1
Out[84]:
Value
Name Date
a 0 13
1 9
2 11
3 16
... ...
Z 96 15
97 20
98 40
99 91
[5200 rows x 1 columns]
In [85]: %timeit get_slice(df1)
100 loops, best of 3: 3.24 ms per loop
In [86]: %timeit df1.reset_index(level='Date').groupby(level='Name').last().set_index('Date', append=True)
100 loops, best of 3: 4.69 ms per loop
In [218]: %timeit df1.groupby(level='Name').tail(1)
1000 loops, best of 3: 1.66 ms per loop
It will depend on the exact application of course, but in many cases such difference in performance won't be significant.
This will get it done:
def get_slice(df):
l0, l1 = df.index.levels
b0, b1 = df.index.labels
n = len(l0)
myslice = range(n)
for i in myslice:
myslice[i] = (l0[i], l1[b1[b0 == i][-1]])
return df.loc[myslice]
Timed
%%timeit
get_slice(df)
1000 loops, best of 3: 458 µs per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With