Simply speaking, how to apply quantile normalization on a large Pandas dataframe (probably 2,000,000 rows) in Python?
PS. I know that there is a package named rpy2 which could run R in subprocess, using quantile normalize in R. But the truth is that R cannot compute the correct result when I use the data set as below:
5.690386092696389541e-05,2.051450375415418849e-05,1.963190184049079707e-05,1.258362869906251862e-04,1.503352476021528139e-04,6.881341586355676286e-06
8.535579139044583634e-05,5.128625938538547123e-06,1.635991820040899643e-05,6.291814349531259308e-05,3.006704952043056075e-05,6.881341586355676286e-06
5.690386092696389541e-05,2.051450375415418849e-05,1.963190184049079707e-05,1.258362869906251862e-04,1.503352476021528139e-04,6.881341586355676286e-06
2.845193046348194770e-05,1.538587781561563968e-05,2.944785276073619561e-05,4.194542899687506431e-05,6.013409904086112150e-05,1.032201237953351358e-05
Edit:
What I want:
Given the data shown above, how to apply quantile normalization following steps in https://en.wikipedia.org/wiki/Quantile_normalization.
I found a piece of code in Python declaring that it could compute the quantile normalization:
import rpy2.robjects as robjects
import numpy as np
from rpy2.robjects.packages import importr
preprocessCore = importr('preprocessCore')
matrix = [ [1,2,3,4,5], [1,3,5,7,9], [2,4,6,8,10] ]
v = robjects.FloatVector([ element for col in matrix for element in col ])
m = robjects.r['matrix'](v, ncol = len(matrix), byrow=False)
Rnormalized_matrix = preprocessCore.normalize_quantiles(m)
normalized_matrix = np.array( Rnormalized_matrix)
The code works fine with the sample data used in the code, however when I test it with the data given above the result went wrong.
Since ryp2 provides an interface to run R in python subprocess, I test it again in R directly and the result was still wrong. As a result I think the reason is that the method in R is wrong.
The quantile normalization (QN) procedure is simple (Fig. 1A): it involves first ranking the gene of each sample by magnitude, calculating the average value for genes occupying the same rank, and then substituting the values of all genes occupying that particular rank with this average value.
To normalize all columns of the dataframe, we first subtract the column mean, and then divide by the standard deviation. Then, we range all columns of the dataframe, such that the min is 0 and the max is 1.
Using The min-max feature scaling The min-max approach (often called normalization) rescales the feature to a hard and fast range of [0,1] by subtracting the minimum value of the feature then dividing by the range. We can apply the min-max scaling in Pandas using the . min() and . max() methods.
Quantile normalization is a global adjustment normalization method that transforms the statistical distributions across samples to be the same and assumes global differences in the distribution are induced by technical variation (Amaratunga and Cabrera, 2001; Bolstad and others, 2003).
Using the example dataset from Wikipedia article:
df = pd.DataFrame({'C1': {'A': 5, 'B': 2, 'C': 3, 'D': 4},
'C2': {'A': 4, 'B': 1, 'C': 4, 'D': 2},
'C3': {'A': 3, 'B': 4, 'C': 6, 'D': 8}})
df
Out:
C1 C2 C3
A 5 4 3
B 2 1 4
C 3 4 6
D 4 2 8
For each rank, the mean value can be calculated with the following:
rank_mean = df.stack().groupby(df.rank(method='first').stack().astype(int)).mean()
rank_mean
Out:
1 2.000000
2 3.000000
3 4.666667
4 5.666667
dtype: float64
Then the resulting Series, rank_mean
, can be used as a mapping for the ranks to get the normalized results:
df.rank(method='min').stack().astype(int).map(rank_mean).unstack()
Out:
C1 C2 C3
A 5.666667 4.666667 2.000000
B 2.000000 2.000000 3.000000
C 3.000000 4.666667 4.666667
D 4.666667 3.000000 5.666667
Ok I implemented the method myself of relatively high efficiency.
After finishing, this logic seems kind of easy but, anyway, I decided to post it here for any one feels confused like I was when I couldn't googled the available code.
The code is in github: Quantile Normalize
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With