quantile normalization on pandas dataframe

Tags:

Simply speaking, how to apply quantile normalization on a large Pandas dataframe (probably 2,000,000 rows) in Python?

PS. I know that there is a package named rpy2 which could run R in subprocess, using quantile normalize in R. But the truth is that R cannot compute the correct result when I use the data set as below:

5.690386092696389541e-05,2.051450375415418849e-05,1.963190184049079707e-05,1.258362869906251862e-04,1.503352476021528139e-04,6.881341586355676286e-06
8.535579139044583634e-05,5.128625938538547123e-06,1.635991820040899643e-05,6.291814349531259308e-05,3.006704952043056075e-05,6.881341586355676286e-06
5.690386092696389541e-05,2.051450375415418849e-05,1.963190184049079707e-05,1.258362869906251862e-04,1.503352476021528139e-04,6.881341586355676286e-06
2.845193046348194770e-05,1.538587781561563968e-05,2.944785276073619561e-05,4.194542899687506431e-05,6.013409904086112150e-05,1.032201237953351358e-05

Edit:

What I want:

Given the data shown above, how to apply quantile normalization following steps in https://en.wikipedia.org/wiki/Quantile_normalization.

I found a piece of code in Python declaring that it could compute the quantile normalization:

import rpy2.robjects as robjects
import numpy as np
from rpy2.robjects.packages import importr
preprocessCore = importr('preprocessCore')


matrix = [ [1,2,3,4,5], [1,3,5,7,9], [2,4,6,8,10] ]
v = robjects.FloatVector([ element for col in matrix for element in col ])
m = robjects.r['matrix'](v, ncol = len(matrix), byrow=False)
Rnormalized_matrix = preprocessCore.normalize_quantiles(m)
normalized_matrix = np.array( Rnormalized_matrix)

The code works fine with the sample data used in the code, however when I test it with the data given above the result went wrong.

Since ryp2 provides an interface to run R in python subprocess, I test it again in R directly and the result was still wrong. As a result I think the reason is that the method in R is wrong.

813

asked Jun 21 '16 05:06

Shawn. L

2 Answers

Using the example dataset from Wikipedia article:

df = pd.DataFrame({'C1': {'A': 5, 'B': 2, 'C': 3, 'D': 4},
                   'C2': {'A': 4, 'B': 1, 'C': 4, 'D': 2},
                   'C3': {'A': 3, 'B': 4, 'C': 6, 'D': 8}})

df
Out: 
   C1  C2  C3
A   5   4   3
B   2   1   4
C   3   4   6
D   4   2   8

For each rank, the mean value can be calculated with the following:

rank_mean = df.stack().groupby(df.rank(method='first').stack().astype(int)).mean()

rank_mean
Out: 
1    2.000000
2    3.000000
3    4.666667
4    5.666667
dtype: float64

Then the resulting Series, rank_mean, can be used as a mapping for the ranks to get the normalized results:

df.rank(method='min').stack().astype(int).map(rank_mean).unstack()
Out: 
         C1        C2        C3
A  5.666667  4.666667  2.000000
B  2.000000  2.000000  3.000000
C  3.000000  4.666667  4.666667
D  4.666667  3.000000  5.666667

199

answered Sep 30 '22 08:09

ayhan

Ok I implemented the method myself of relatively high efficiency.

After finishing, this logic seems kind of easy but, anyway, I decided to post it here for any one feels confused like I was when I couldn't googled the available code.

The code is in github: Quantile Normalize

answered Sep 30 '22 08:09

Shawn. L

Related questions
                            
                                How to resize window in opencv2 python
                            
                                Is there a MATLAB accumarray equivalent in numpy?
                            
                                Enumerate each row for each group in a DataFrame
                            
                                Pycurl and io.StringIO - pycurl.error: (23, 'Failed writing body)
                            
                                Add item to pandas.Series?
                            
                                Python multiprocessing example not working
                            
                                Running code in PyCharm's console
                            
                                Zbar + python, crashes on import (OSX 10.9.1)
                            
                                Iterate over a dictionary by comprehension and get a dictionary [duplicate]
                            
                                Plotting time-series data with seaborn
                            
                                What is more efficient .objects.filter().exists() or get() wrapped on a try
                            
                                Recursive feature elimination on Random Forest using scikit-learn
                            
                                traceback from a warning
                            
                                Operator NOT IN with Peewee
                            
                                'str' object has no attribute 'decode' in Python3
                            
                                base64.encodestring failing in python 3
                            
                                Using str.contains on pandas dataframe [duplicate]
                            
                                How to I hide my secret_key using virtualenv and Django?
                            
                                Django models: add index on date, desc order
                            
                                Error running Django in Intellij / Pycharm

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

quantile normalization on pandas dataframe

Tags:

python

deep-learning

data-science

Shawn. L

People also ask

2 Answers

ayhan

Shawn. L

Recent Activity

Donate For Us