Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to keep original index of a DataFrame after groupby 2 columns?

Is there any way I can retain the original index of my large dataframe after I perform a groupby? The reason I need to this is because I need to do an inner merge back to my original df (after my groupby) to regain those lost columns. And the index value is the only 'unique' column to perform the merge back into. Does anyone know how I can achieve this?

My DataFrame is quite large. My groupby looks like this:

df.groupby(['col1', 'col2']).agg({'col3': 'count'}).reset_index()

This drops my original indexes from my original dataframe, which I want to keep.

like image 656
Hana Avatar asked Mar 11 '18 03:03

Hana


People also ask

Does pandas Groupby preserve index?

The Groupby Rolling function does not preserve the original index and so when dates are the same within the Group, it is impossible to know which index value it pertains to from the original dataframe.

Why do we reset index after group by?

Resetting the index after grouping data, using reset_index(), it is a function provided by python to add indexes to the data.

How do I keep the original Panda index?

If you want to keep the original index as a column, use reset_index() to reassign the index to a sequential number starting from 0 . You can change the index to a different column by using set_index() after reset_index() . See also the following article for reset_index() .

How do I reset index in Groupby?

To reset index after group by, at first group according to a column using groupby(). After that, use reset_index().


2 Answers

You can elevate your index to a column via reset_index. Then aggregate your index to a tuple via agg, together with your count aggregation.

Below is a minimal example.

import pandas as pd, numpy as np

df = pd.DataFrame(np.random.randint(0, 4, (50, 5)),
                  index=np.random.randint(0, 4, 50))

df = df.reset_index()

res = df.groupby([0, 1]).agg({2: 'count', 'index': lambda x: tuple(x)}).reset_index()

#     0  1  2            index
# 0   0  0  4     (2, 0, 0, 2)
# 1   0  1  4     (0, 3, 1, 1)
# 2   0  2  1             (1,)
# 3   0  3  1             (3,)
# 4   1  0  4     (1, 2, 1, 3)
# 5   1  1  2           (1, 3)
# 6   1  2  4     (2, 1, 2, 2)
# 7   1  3  1             (2,)
# 8   2  0  5  (0, 3, 0, 2, 2)
# 9   2  1  2           (0, 2)
# 10  2  2  5  (1, 1, 3, 3, 2)
# 11  2  3  2           (0, 1)
# 12  3  0  4     (0, 3, 3, 3)
# 13  3  1  4     (1, 3, 0, 1)
# 14  3  2  3        (3, 2, 1)
# 15  3  3  4     (3, 3, 2, 1)
like image 160
jpp Avatar answered Sep 20 '22 15:09

jpp


I think you are are looking for transform in this situation:

df['count'] = df.groupby(['col1', 'col2'])['col3'].transform('count')
like image 28
Scott Boston Avatar answered Sep 17 '22 15:09

Scott Boston