With the following dataframe: <pre class="prettyprint"><code>import pandas as pd df=pd.DataFrame(data=[[1,5179530,'rs10799170',8.1548,'E001'], [1,5179530,'rs10799170',8.1548,'E002'], [1,5179530,'rs10799170',8.1548,'E003'], [1,455521,'rs235884',2.584,'E003'], [1,455521,'rs235884',2.584,'E007']], col umns=['CHR','BP','SNP','CM','ANNOT']) CHR BP SNP CM ANNOT 0 1 5179530 rs10799170 8.1548 E001 1 1 5179530 rs10799170 8.1548 E002 2 1 5179530 rs10799170 8.1548 E003 3 1 455521 rs235884 2.5840 E003 4 1 455521 rs235884 2.5840 E007 </code></pre> I would like to obtain <pre class="prettyprint"><code> CHR BP SNP CM E001 E002 E003 E007 0 1 5179530 rs10799170 8.1548 1 1 1 0 1 1 455521 rs235884 2.5840 0 0 1 1 </code></pre> I tried <code>groupby()</code> and <code>get_dummies()</code> separately <pre class="prettyprint"><code>df.groupby(['CHR','BP','SNP','CM']).sum() CHR BP SNP CM ANNOT 1 455521 rs235884 2.5840 E003E007 5179530 rs10799170 8.1548 E001E002E003 pd.get_dummies(df['ANNOT']) E001 E002 E003 E007 0 1 0 0 0 1 0 1 0 0 2 0 0 1 0 3 0 0 1 0 4 0 0 0 1 </code></pre> But I don't know how to combine both or if there is another way.

As @Dadep points out in their comment, this can be achieved with a pivot table. If you want to stick to your <code>get_dummies</code> + <code>groupby</code> technique though you can do something like: <pre class="prettyprint"><code>pd.concat([df, pd.get_dummies(df.ANNOT)], 1).groupby(['CHR','BP','SNP','CM']).sum().reset_index() </code></pre> This first concatenates your dataframe and the output of the <code>get_dummies</code> call, then it groups the result according to the relevant columns, takes the sum of those columns among those groups and then resets the index so you don't have to deal with a multi-index data frame. The result looks like: <pre class="prettyprint"><code> CHR BP SNP CM E001 E002 E003 E007 0 1 455521 rs235884 2.5840 0 0 1 1 1 1 5179530 rs10799170 8.1548 1 1 1 0 </code></pre>

Pandas - Merge rows and add columns with 'get_dummies'

Tags:

python

pandas

dataframe

With the following dataframe:

import pandas as pd
df=pd.DataFrame(data=[[1,5179530,'rs10799170',8.1548,'E001'], [1,5179530,'rs10799170',8.1548,'E002'], [1,5179530,'rs10799170',8.1548,'E003'], [1,455521,'rs235884',2.584,'E003'], [1,455521,'rs235884',2.584,'E007']], col    umns=['CHR','BP','SNP','CM','ANNOT'])

   CHR       BP         SNP      CM ANNOT
0    1  5179530  rs10799170  8.1548  E001
1    1  5179530  rs10799170  8.1548  E002
2    1  5179530  rs10799170  8.1548  E003
3    1   455521    rs235884  2.5840  E003
4    1   455521    rs235884  2.5840  E007

I would like to obtain

   CHR       BP         SNP      CM  E001  E002  E003  E007
0    1  5179530  rs10799170  8.1548     1     1     1     0  
1    1   455521    rs235884  2.5840     0     0     1     1

I tried groupby() and get_dummies() separately

df.groupby(['CHR','BP','SNP','CM']).sum()

    CHR BP      SNP        CM         ANNOT           
1   455521  rs235884   2.5840      E003E007
    5179530 rs10799170 8.1548  E001E002E003

pd.get_dummies(df['ANNOT'])

    E001  E002  E003  E007
0     1     0     0     0
1     0     1     0     0
2     0     0     1     0
3     0     0     1     0
4     0     0     0     1

But I don't know how to combine both or if there is another way.

263

asked Jun 23 '17 12:06

Elysire

1 Answers

As @Dadep points out in their comment, this can be achieved with a pivot table. If you want to stick to your get_dummies + groupby technique though you can do something like:

pd.concat([df, pd.get_dummies(df.ANNOT)], 1).groupby(['CHR','BP','SNP','CM']).sum().reset_index()

This first concatenates your dataframe and the output of the get_dummies call, then it groups the result according to the relevant columns, takes the sum of those columns among those groups and then resets the index so you don't have to deal with a multi-index data frame. The result looks like:

   CHR       BP         SNP      CM  E001  E002  E003  E007
0    1   455521    rs235884  2.5840     0     0     1     1
1    1  5179530  rs10799170  8.1548     1     1     1     0

186

answered Nov 13 '22 05:11

bunji

Related questions
                            
                                parsing a dictionary in a pandas dataframe cell into new row cells (new columns)
                            
                                Implementing skip gram with scikit-learn?
                            
                                Speckle ( Lee Filter) in Python
                            
                                numpy.savetxt- Save one column as int and the rest as floats?
                            
                                Random Forest with bootstrap = False in scikit-learn python
                            
                                Pandas read_csv() 1.2GB file out of memory on VM with 140GB RAM
                            
                                How to scrape all the content of each link with scrapy?
                            
                                pandas - number of unique rows occurrences in dataframe
                            
                                Pandas Multiindex Groupby on Columns
                            
                                Fastest way to get hamming distance for integer array
                            
                                Simple CSV to XML Conversion - Python
                            
                                Conditional or optional context managers in with statement
                            
                                In python pandas, how can I re-sample and interpolate a DataFrame?
                            
                                Python: Concatenate 3 arrays
                            
                                Tensorflow summary: adding a variable which does not belong to computational graph
                            
                                Is possible to keep spacy in memory to reduce the load time? [closed]
                            
                                What do the "(?<!…)" symbols mean in a Python regular expression?
                            
                                Cost of calling str() on a string?
                            
                                Python3 regex on bytes variable [duplicate]
                            
                                How to print out 'Live' mouse position coordinates using pyautogui?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With