Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Collapsing rows in a Pandas dataframe

I'm trying to collapse rows in a dataframe that contains a column of ID data and a number of columns that each hold a different string. It looks like groupby is the solution, but it seems to be slanted towards performing some numeric function on the group - I just want to keep the text. Here's what I've got...

I have a dataframe of the form:

index    ID     apples    pears    oranges
0        101                       oranges
1        134    apples
2        576              pears
3        837    apples
4        576                       oranges
5        134              pears

The columns are clean: so the apples column will only ever have the text "apples" in it, or it will be blank".

Where there are multiple entries under the same ID (in this example, on IDs 134 & 576), I want to collapse the rows together to get this:

index    ID     apples    pears    oranges
0        101                       oranges
1        134    apples    pears
2        576              pears    oranges
3        837    apples

I could do this by iterating over the rows, but it seems like a non-pandas solution. Is there a better way?

like image 486
user4896331 Avatar asked Apr 20 '17 08:04

user4896331


People also ask

How do you drop first 20 rows in pandas?

Use drop() to remove first N rows of pandas dataframe To make sure that it removes the rows only, use argument axis=0 and to make changes in place i.e. in calling dataframe object, pass argument inplace=True.

How do you drop rows with Infinity pandas?

Use the pandas. DataFrame. dropna() method to drop the rows with infinite values.


1 Answers

You can use groupby with aggregation ''.join, sum or max:

#if blank values are NaN first replace to ''
df = df.fillna('')

df = df.groupby('ID').agg(''.join)
print (df)
     apples  pears  oranges
ID                         
101                 oranges
134  apples  pears         
576          pears  oranges
837  apples   

Also works:

df = df.fillna('')
df = df.groupby('ID').sum()
#alternatively max
#df = df.groupby('ID').max()
print (df)
     apples  pears  oranges
ID                         
101                 oranges
134  apples  pears         
576          pears  oranges
837  apples     

Also if need remove duplicates per group and per column add unique:

df = df.groupby('ID').agg(lambda x: ''.join(x.unique()))
like image 73
jezrael Avatar answered Sep 20 '22 11:09

jezrael