I have the following pandas data frame: <pre class="prettyprint"><code>import pandas as pd import numpy as np df = pd.DataFrame({ 'fc': [100,100,112,1.3,14,125], 'sample_id': ['S1','S1','S1','S2','S2','S2'], 'gene_symbol': ['a', 'b', 'c', 'a', 'b', 'c'], }) df = df[['gene_symbol', 'sample_id', 'fc']] df </code></pre> Which produces this: <pre class="prettyprint"><code>Out[11]: gene_symbol sample_id fc 0 a S1 100.0 1 b S1 100.0 2 c S1 112.0 3 a S2 1.3 4 b S2 14.0 5 c S2 125.0 </code></pre> How can I spread <code>sample_id</code> so that in the end I get this: <pre class="prettyprint"><code>gene_symbol S1 S2 a 100 1.3 b 100 14.0 c 112 125.0 </code></pre>

Use <code>pivot</code> or <code>unstack</code>: <pre class="prettyprint"><code>#df = df[['gene_symbol', 'sample_id', 'fc']] df = df.pivot(index='gene_symbol',columns='sample_id',values='fc') print (df) sample_id S1 S2 gene_symbol a 100.0 1.3 b 100.0 14.0 c 112.0 125.0 </code></pre> <hr> <pre class="prettyprint"><code>df = df.set_index(['gene_symbol','sample_id'])['fc'].unstack(fill_value=0) print (df) sample_id S1 S2 gene_symbol a 100.0 1.3 b 100.0 14.0 c 112.0 125.0 </code></pre> But if duplicates, need <code>pivot_table</code> or aggregate with <code>groupby</code> or , <code>mean</code> can be changed to <code>sum</code>, <code>median</code>, ...: <pre class="prettyprint"><code>df = pd.DataFrame({ 'fc': [100,100,112,1.3,14,125, 100], 'sample_id': ['S1','S1','S1','S2','S2','S2', 'S2'], 'gene_symbol': ['a', 'b', 'c', 'a', 'b', 'c', 'c'], }) print (df) fc gene_symbol sample_id 0 100.0 a S1 1 100.0 b S1 2 112.0 c S1 3 1.3 a S2 4 14.0 b S2 5 125.0 c S2 <- same c, S2, different fc 6 100.0 c S2 <- same c, S2, different fc </code></pre> <pre class="prettyprint"><code>df = df.pivot(index='gene_symbol',columns='sample_id',values='fc') </code></pre> <blockquote> ValueError: Index contains duplicate entries, cannot reshape </blockquote> <pre class="prettyprint"><code>df = df.pivot_table(index='gene_symbol',columns='sample_id',values='fc', aggfunc='mean') print (df) sample_id S1 S2 gene_symbol a 100.0 1.3 b 100.0 14.0 c 112.0 112.5 </code></pre> <hr> <pre class="prettyprint"><code>df = df.groupby(['gene_symbol','sample_id'])['fc'].mean().unstack(fill_value=0) print (df) sample_id S1 S2 gene_symbol a 100.0 1.3 b 100.0 14.0 c 112.0 112.5 </code></pre> EDIT: For cleaning set <code>columns name</code> to <code>None</code> and <code>reset_index</code>: <pre class="prettyprint"><code>df.columns.name = None df = df.reset_index() print (df) gene_symbol S1 S2 0 a 100.0 1.3 1 b 100.0 14.0 2 c 112.0 112.5 </code></pre>

How to spread a column in a Pandas data frame

Tags:

python

pandas

dataframe

pivot

I have the following pandas data frame:

import pandas as pd
import numpy as np
df = pd.DataFrame({
               'fc': [100,100,112,1.3,14,125],
               'sample_id': ['S1','S1','S1','S2','S2','S2'],
               'gene_symbol': ['a', 'b', 'c', 'a', 'b', 'c'],
               })

df = df[['gene_symbol', 'sample_id', 'fc']]
df

Which produces this:

Out[11]:
  gene_symbol sample_id     fc
0           a        S1  100.0
1           b        S1  100.0
2           c        S1  112.0
3           a        S2    1.3
4           b        S2   14.0
5           c        S2  125.0

How can I spread sample_id so that in the end I get this:

gene_symbol    S1   S2
a             100   1.3
b             100   14.0
c             112   125.0

464

asked May 15 '17 07:05

neversaint

1 Answers

Use pivot or unstack:

#df = df[['gene_symbol', 'sample_id', 'fc']]
df = df.pivot(index='gene_symbol',columns='sample_id',values='fc')
print (df)
sample_id       S1     S2
gene_symbol              
a            100.0    1.3
b            100.0   14.0
c            112.0  125.0

df = df.set_index(['gene_symbol','sample_id'])['fc'].unstack(fill_value=0)
print (df)
sample_id       S1     S2
gene_symbol              
a            100.0    1.3
b            100.0   14.0
c            112.0  125.0

But if duplicates, need pivot_table or aggregate with groupby or , mean can be changed to sum, median, ...:

df = pd.DataFrame({
               'fc': [100,100,112,1.3,14,125, 100],
               'sample_id': ['S1','S1','S1','S2','S2','S2', 'S2'],
               'gene_symbol': ['a', 'b', 'c', 'a', 'b', 'c', 'c'],
               })
print (df)
      fc gene_symbol sample_id
0  100.0           a        S1
1  100.0           b        S1
2  112.0           c        S1
3    1.3           a        S2
4   14.0           b        S2
5  125.0           c        S2 <- same c, S2, different fc
6  100.0           c        S2 <- same c, S2, different fc

df = df.pivot(index='gene_symbol',columns='sample_id',values='fc')

ValueError: Index contains duplicate entries, cannot reshape

df = df.pivot_table(index='gene_symbol',columns='sample_id',values='fc', aggfunc='mean')
print (df)
sample_id       S1     S2
gene_symbol              
a            100.0    1.3
b            100.0   14.0
c            112.0  112.5

df = df.groupby(['gene_symbol','sample_id'])['fc'].mean().unstack(fill_value=0)
print (df)
sample_id       S1     S2
gene_symbol              
a            100.0    1.3
b            100.0   14.0
c            112.0  112.5

EDIT:

For cleaning set columns name to None and reset_index:

df.columns.name = None
df = df.reset_index()
print (df)
  gene_symbol     S1     S2
0           a  100.0    1.3
1           b  100.0   14.0
2           c  112.0  112.5

answered Oct 03 '22 19:10

jezrael

Related questions
                            
                                Using NOT EXISTS clause in sqlalchemy ORM query
                            
                                How do numpy's in-place operations (e.g. `+=`) work?
                            
                                python total_ordering : why __lt__ and __eq__ instead of __le__?
                            
                                Why a calling function in python contains variable equal to value?
                            
                                how to handle 302 redirect in scrapy
                            
                                Dictionary Iterating -- for dict vs for dict.items()
                            
                                specifying "skip NA" when calculating mean of the column in a data frame created by Pandas
                            
                                Python asyncio debugging example
                            
                                Python pandas time series interpolation and regularization
                            
                                Getting all superclasses in Python 3
                            
                                python - uploading a plot from memory to s3 using matplotlib and boto
                            
                                WebAssembly, JavaScript, and other languages
                            
                                how to to terminate process using python's multiprocessing
                            
                                DataFrame object has no attribute 'sort_values'
                            
                                Importing and changing variables from another file
                            
                                What is difference between str.format_map(mapping) and str.format
                            
                                get text after specific tag with beautiful soup
                            
                                Creating "virtualenv" for an existing project
                            
                                python os.walk to certain level [duplicate]
                            
                                Python: difference between ValueError and Exception?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With