I have a CSV that looks like this:
gene,stem1,stem2,stem3,b1,b2,b3,special_col foo,20,10,11,23,22,79,3 bar,17,13,505,12,13,88,1 qui,17,13,5,12,13,88,3
And as data frame it looks like this:
In [17]: import pandas as pd In [20]: df = pd.read_table("http://dpaste.com/3PQV3FA.txt",sep=",") In [21]: df Out[21]: gene stem1 stem2 stem3 b1 b2 b3 special_col 0 foo 20 10 11 23 22 79 3 1 bar 17 13 505 12 13 88 1 2 qui 17 13 5 12 13 88 3
What I want to do is to perform pearson correlation from last column (special_col
) with every columns between gene
column and special column
, i.e. colnames[1:number_of_column-1]
At the end of the day we will have length 6 data frame.
Coln PearCorr stem1 0.5 stem2 -0.5 stem3 -0.9999453506011533 b1 0.5 b2 0.5 b3 -0.5
The above value is computed manually:
In [27]: import scipy.stats In [39]: scipy.stats.pearsonr([3, 1, 3], [11,505,5]) Out[39]: (-0.9999453506011533, 0.0066556395400007278)
How can I do that?
Initialize two variables, col1 and col2, and assign them the columns that you want to find the correlation of. Find the correlation between col1 and col2 by using df[col1]. corr(df[col2]) and save the correlation value in a variable, corr. Print the correlation value, corr.
By using corr() function we can get the correlation between two columns in the dataframe.
Pandas makes it very easy to find the correlation coefficient! We can simply call the . corr() method on the dataframe of interest. The method returns a correlation matrix that shows the coefficient of correlation between different variables.
Note there is a mistake in your data, there special col is all 3, so no correlation can be computed.
If you remove the column selection in the end you'll get a correlation matrix of all other columns you are analysing. The last [:-1] is to remove correlation of 'special_col' with itself.
In [15]: data[data.columns[1:]].corr()['special_col'][:-1] Out[15]: stem1 0.500000 stem2 -0.500000 stem3 -0.999945 b1 0.500000 b2 0.500000 b3 -0.500000 Name: special_col, dtype: float64
If you are interested in speed, this is slightly faster on my machine:
In [33]: np.corrcoef(data[data.columns[1:]].T)[-1][:-1] Out[33]: array([ 0.5 , -0.5 , -0.99994535, 0.5 , 0.5 , -0.5 ]) In [34]: %timeit np.corrcoef(data[data.columns[1:]].T)[-1][:-1] 1000 loops, best of 3: 437 µs per loop In [35]: %timeit data[data.columns[1:]].corr()['special_col'] 1000 loops, best of 3: 526 µs per loop
But obviously, it returns an array, not a pandas series/DF.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With