Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - Merge rows and add columns with 'get_dummies'

With the following dataframe:

import pandas as pd
df=pd.DataFrame(data=[[1,5179530,'rs10799170',8.1548,'E001'], [1,5179530,'rs10799170',8.1548,'E002'], [1,5179530,'rs10799170',8.1548,'E003'], [1,455521,'rs235884',2.584,'E003'], [1,455521,'rs235884',2.584,'E007']], col    umns=['CHR','BP','SNP','CM','ANNOT'])

   CHR       BP         SNP      CM ANNOT
0    1  5179530  rs10799170  8.1548  E001
1    1  5179530  rs10799170  8.1548  E002
2    1  5179530  rs10799170  8.1548  E003
3    1   455521    rs235884  2.5840  E003
4    1   455521    rs235884  2.5840  E007

I would like to obtain

   CHR       BP         SNP      CM  E001  E002  E003  E007
0    1  5179530  rs10799170  8.1548     1     1     1     0  
1    1   455521    rs235884  2.5840     0     0     1     1

I tried groupby() and get_dummies() separately

df.groupby(['CHR','BP','SNP','CM']).sum()

    CHR BP      SNP        CM         ANNOT           
1   455521  rs235884   2.5840      E003E007
    5179530 rs10799170 8.1548  E001E002E003

pd.get_dummies(df['ANNOT'])

    E001  E002  E003  E007
0     1     0     0     0
1     0     1     0     0
2     0     0     1     0
3     0     0     1     0
4     0     0     0     1

But I don't know how to combine both or if there is another way.

like image 263
Elysire Avatar asked Jun 23 '17 12:06

Elysire


People also ask

What does the Get_dummies () function in Pandas do?

get_dummies() is used for data manipulation. It converts categorical data into dummy or indicator variables.

What does Drop_first do in Get_dummies?

getdummies() we convert them into a binary vector which makes 10 columns, one column for each unique value of our original column and wherever this value is true for a row it is indicated as 1 else 0. if drop_first is true it removes the first column which is created for the first unique value of a column.

How do I merge rows in Pandas DataFrame?

We can use the concat function in pandas to append either columns or rows from one DataFrame to another. Let's grab two subsets of our data to see how this works. When we concatenate DataFrames, we need to specify the axis. axis=0 tells pandas to stack the second DataFrame UNDER the first one.


1 Answers

As @Dadep points out in their comment, this can be achieved with a pivot table. If you want to stick to your get_dummies + groupby technique though you can do something like:

pd.concat([df, pd.get_dummies(df.ANNOT)], 1).groupby(['CHR','BP','SNP','CM']).sum().reset_index()

This first concatenates your dataframe and the output of the get_dummies call, then it groups the result according to the relevant columns, takes the sum of those columns among those groups and then resets the index so you don't have to deal with a multi-index data frame. The result looks like:

   CHR       BP         SNP      CM  E001  E002  E003  E007
0    1   455521    rs235884  2.5840     0     0     1     1
1    1  5179530  rs10799170  8.1548     1     1     1     0
like image 186
bunji Avatar answered Nov 13 '22 05:11

bunji