Is there any way I can retain the original index of my large dataframe after I perform a groupby? The reason I need to this is because I need to do an inner merge back to my original df (after my groupby) to regain those lost columns. And the index value is the only 'unique' column to perform the merge back into. Does anyone know how I can achieve this?
My DataFrame is quite large. My groupby looks like this:
df.groupby(['col1', 'col2']).agg({'col3': 'count'}).reset_index()
This drops my original indexes from my original dataframe, which I want to keep.
The Groupby Rolling function does not preserve the original index and so when dates are the same within the Group, it is impossible to know which index value it pertains to from the original dataframe.
Resetting the index after grouping data, using reset_index(), it is a function provided by python to add indexes to the data.
If you want to keep the original index as a column, use reset_index() to reassign the index to a sequential number starting from 0 . You can change the index to a different column by using set_index() after reset_index() . See also the following article for reset_index() .
To reset index after group by, at first group according to a column using groupby(). After that, use reset_index().
You can elevate your index to a column via reset_index
. Then aggregate your index to a tuple via agg
, together with your count
aggregation.
Below is a minimal example.
import pandas as pd, numpy as np
df = pd.DataFrame(np.random.randint(0, 4, (50, 5)),
index=np.random.randint(0, 4, 50))
df = df.reset_index()
res = df.groupby([0, 1]).agg({2: 'count', 'index': lambda x: tuple(x)}).reset_index()
# 0 1 2 index
# 0 0 0 4 (2, 0, 0, 2)
# 1 0 1 4 (0, 3, 1, 1)
# 2 0 2 1 (1,)
# 3 0 3 1 (3,)
# 4 1 0 4 (1, 2, 1, 3)
# 5 1 1 2 (1, 3)
# 6 1 2 4 (2, 1, 2, 2)
# 7 1 3 1 (2,)
# 8 2 0 5 (0, 3, 0, 2, 2)
# 9 2 1 2 (0, 2)
# 10 2 2 5 (1, 1, 3, 3, 2)
# 11 2 3 2 (0, 1)
# 12 3 0 4 (0, 3, 3, 3)
# 13 3 1 4 (1, 3, 0, 1)
# 14 3 2 3 (3, 2, 1)
# 15 3 3 4 (3, 3, 2, 1)
I think you are are looking for transform in this situation:
df['count'] = df.groupby(['col1', 'col2'])['col3'].transform('count')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With