I have a pandas
dataframe with several rows that are near duplicates of each other, except for one value. My goal is to merge or "coalesce" these rows into a single row, without summing the numerical values.
Here is an example of what I'm working with:
Name Sid Use_Case Revenue A xx01 Voice $10.00 A xx01 SMS $10.00 B xx02 Voice $5.00 C xx03 Voice $15.00 C xx03 SMS $15.00 C xx03 Video $15.00
And here is what I would like:
Name Sid Use_Case Revenue A xx01 Voice, SMS $10.00 B xx02 Voice $5.00 C xx03 Voice, SMS, Video $15.00
The reason I don't want to sum the "Revenue" column is because my table is the result of doing a pivot over several time periods where "Revenue" simply ends up getting listed multiple times instead of having a different value per "Use_Case".
What would be the best way to tackle this issue? I've looked into the groupby()
function but I still don't understand it very well.
I think you can use groupby
with aggregate
first
and custom function ', '.join
:
df = df.groupby('Name').agg({'Sid':'first', 'Use_Case': ', '.join, 'Revenue':'first' }).reset_index() #change column order print df[['Name','Sid','Use_Case','Revenue']] Name Sid Use_Case Revenue 0 A xx01 Voice, SMS $10.00 1 B xx02 Voice $5.00 2 C xx03 Voice, SMS, Video $15.00
Nice idea from comment, thanks Goyo:
df = df.groupby(['Name','Sid','Revenue'])['Use_Case'].apply(', '.join).reset_index() #change column order print df[['Name','Sid','Use_Case','Revenue']] Name Sid Use_Case Revenue 0 A xx01 Voice, SMS $10.00 1 B xx02 Voice $5.00 2 C xx03 Voice, SMS, Video $15.00
You can groupby
and apply
the list
function:
>>> df['Use_Case'].groupby([df.Name, df.Sid, df.Revenue]).apply(list).reset_index() Name Sid Revenue 0 0 A xx01 $10.00 [Voice, SMS] 1 B xx02 $5.00 [Voice] 2 C xx03 $15.00 [Voice, SMS, Video]
(In case you are concerned about duplicates, use set
instead of list
.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With