Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas - Merge nearly duplicate rows based on column value

Tags:

python

pandas

I have a pandas dataframe with several rows that are near duplicates of each other, except for one value. My goal is to merge or "coalesce" these rows into a single row, without summing the numerical values.

Here is an example of what I'm working with:

Name   Sid   Use_Case  Revenue A      xx01  Voice     $10.00 A      xx01  SMS       $10.00 B      xx02  Voice     $5.00 C      xx03  Voice     $15.00 C      xx03  SMS       $15.00 C      xx03  Video     $15.00 

And here is what I would like:

Name   Sid   Use_Case            Revenue A      xx01  Voice, SMS          $10.00 B      xx02  Voice               $5.00 C      xx03  Voice, SMS, Video   $15.00 

The reason I don't want to sum the "Revenue" column is because my table is the result of doing a pivot over several time periods where "Revenue" simply ends up getting listed multiple times instead of having a different value per "Use_Case".

What would be the best way to tackle this issue? I've looked into the groupby() function but I still don't understand it very well.

like image 916
Matthew Rosenthal Avatar asked Mar 28 '16 21:03

Matthew Rosenthal


2 Answers

I think you can use groupby with aggregate first and custom function ', '.join:

df = df.groupby('Name').agg({'Sid':'first',                               'Use_Case': ', '.join,                               'Revenue':'first' }).reset_index()  #change column order                            print df[['Name','Sid','Use_Case','Revenue']]                                 Name   Sid           Use_Case Revenue 0    A  xx01         Voice, SMS  $10.00 1    B  xx02              Voice   $5.00 2    C  xx03  Voice, SMS, Video  $15.00 

Nice idea from comment, thanks Goyo:

df = df.groupby(['Name','Sid','Revenue'])['Use_Case'].apply(', '.join).reset_index()  #change column order                            print df[['Name','Sid','Use_Case','Revenue']]                                 Name   Sid           Use_Case Revenue 0    A  xx01         Voice, SMS  $10.00 1    B  xx02              Voice   $5.00 2    C  xx03  Voice, SMS, Video  $15.00 
like image 197
jezrael Avatar answered Sep 19 '22 01:09

jezrael


You can groupby and apply the list function:

>>> df['Use_Case'].groupby([df.Name, df.Sid, df.Revenue]).apply(list).reset_index()     Name    Sid     Revenue     0 0   A   xx01    $10.00  [Voice, SMS] 1   B   xx02    $5.00   [Voice] 2   C   xx03    $15.00  [Voice, SMS, Video] 

(In case you are concerned about duplicates, use set instead of list.)

like image 25
Ami Tavory Avatar answered Sep 21 '22 01:09

Ami Tavory