Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

quantile normalization on pandas dataframe

Simply speaking, how to apply quantile normalization on a large Pandas dataframe (probably 2,000,000 rows) in Python?

PS. I know that there is a package named rpy2 which could run R in subprocess, using quantile normalize in R. But the truth is that R cannot compute the correct result when I use the data set as below:

5.690386092696389541e-05,2.051450375415418849e-05,1.963190184049079707e-05,1.258362869906251862e-04,1.503352476021528139e-04,6.881341586355676286e-06
8.535579139044583634e-05,5.128625938538547123e-06,1.635991820040899643e-05,6.291814349531259308e-05,3.006704952043056075e-05,6.881341586355676286e-06
5.690386092696389541e-05,2.051450375415418849e-05,1.963190184049079707e-05,1.258362869906251862e-04,1.503352476021528139e-04,6.881341586355676286e-06
2.845193046348194770e-05,1.538587781561563968e-05,2.944785276073619561e-05,4.194542899687506431e-05,6.013409904086112150e-05,1.032201237953351358e-05

Edit:

What I want:

Given the data shown above, how to apply quantile normalization following steps in https://en.wikipedia.org/wiki/Quantile_normalization.

I found a piece of code in Python declaring that it could compute the quantile normalization:

import rpy2.robjects as robjects
import numpy as np
from rpy2.robjects.packages import importr
preprocessCore = importr('preprocessCore')


matrix = [ [1,2,3,4,5], [1,3,5,7,9], [2,4,6,8,10] ]
v = robjects.FloatVector([ element for col in matrix for element in col ])
m = robjects.r['matrix'](v, ncol = len(matrix), byrow=False)
Rnormalized_matrix = preprocessCore.normalize_quantiles(m)
normalized_matrix = np.array( Rnormalized_matrix)

The code works fine with the sample data used in the code, however when I test it with the data given above the result went wrong.

Since ryp2 provides an interface to run R in python subprocess, I test it again in R directly and the result was still wrong. As a result I think the reason is that the method in R is wrong.

like image 813
Shawn. L Avatar asked Jun 21 '16 05:06

Shawn. L


People also ask

How do you normalize quantile?

The quantile normalization (QN) procedure is simple (Fig. 1A): it involves first ranking the gene of each sample by magnitude, calculating the average value for genes occupying the same rank, and then substituting the values of all genes occupying that particular rank with this average value.

How do you normalize a DataFrame in Python?

To normalize all columns of the dataframe, we first subtract the column mean, and then divide by the standard deviation. Then, we range all columns of the dataframe, such that the min is 0 and the max is 1.

How do I normalize data in pandas?

Using The min-max feature scaling The min-max approach (often called normalization) rescales the feature to a hard and fast range of [0,1] by subtracting the minimum value of the feature then dividing by the range. We can apply the min-max scaling in Pandas using the . min() and . max() methods.

What does quantile normalization achieve?

Quantile normalization is a global adjustment normalization method that transforms the statistical distributions across samples to be the same and assumes global differences in the distribution are induced by technical variation (Amaratunga and Cabrera, 2001; Bolstad and others, 2003).


2 Answers

Using the example dataset from Wikipedia article:

df = pd.DataFrame({'C1': {'A': 5, 'B': 2, 'C': 3, 'D': 4},
                   'C2': {'A': 4, 'B': 1, 'C': 4, 'D': 2},
                   'C3': {'A': 3, 'B': 4, 'C': 6, 'D': 8}})

df
Out: 
   C1  C2  C3
A   5   4   3
B   2   1   4
C   3   4   6
D   4   2   8

For each rank, the mean value can be calculated with the following:

rank_mean = df.stack().groupby(df.rank(method='first').stack().astype(int)).mean()

rank_mean
Out: 
1    2.000000
2    3.000000
3    4.666667
4    5.666667
dtype: float64

Then the resulting Series, rank_mean, can be used as a mapping for the ranks to get the normalized results:

df.rank(method='min').stack().astype(int).map(rank_mean).unstack()
Out: 
         C1        C2        C3
A  5.666667  4.666667  2.000000
B  2.000000  2.000000  3.000000
C  3.000000  4.666667  4.666667
D  4.666667  3.000000  5.666667
like image 199
ayhan Avatar answered Sep 30 '22 08:09

ayhan


Ok I implemented the method myself of relatively high efficiency.

After finishing, this logic seems kind of easy but, anyway, I decided to post it here for any one feels confused like I was when I couldn't googled the available code.

The code is in github: Quantile Normalize

like image 44
Shawn. L Avatar answered Sep 30 '22 08:09

Shawn. L