Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do Pearson correlation of selected columns of a Pandas data frame

Tags:

I have a CSV that looks like this:

gene,stem1,stem2,stem3,b1,b2,b3,special_col foo,20,10,11,23,22,79,3 bar,17,13,505,12,13,88,1 qui,17,13,5,12,13,88,3 

And as data frame it looks like this:

In [17]: import pandas as pd In [20]: df = pd.read_table("http://dpaste.com/3PQV3FA.txt",sep=",") In [21]: df Out[21]:   gene  stem1  stem2  stem3  b1  b2  b3  special_col 0  foo     20     10     11  23  22  79            3 1  bar     17     13    505  12  13  88            1 2  qui     17     13      5  12  13  88            3 

What I want to do is to perform pearson correlation from last column (special_col) with every columns between gene column and special column, i.e. colnames[1:number_of_column-1]

At the end of the day we will have length 6 data frame.

Coln   PearCorr stem1  0.5 stem2 -0.5 stem3 -0.9999453506011533 b1    0.5 b2    0.5 b3    -0.5 

The above value is computed manually:

In [27]: import scipy.stats In [39]: scipy.stats.pearsonr([3, 1, 3], [11,505,5]) Out[39]: (-0.9999453506011533, 0.0066556395400007278) 

How can I do that?

like image 834
neversaint Avatar asked Jan 20 '16 09:01

neversaint


People also ask

How do you find the correlation between selected columns in Pandas?

Initialize two variables, col1 and col2, and assign them the columns that you want to find the correlation of. Find the correlation between col1 and col2 by using df[col1]. corr(df[col2]) and save the correlation value in a variable, corr. Print the correlation value, corr.

How do you find the correlation between columns in Python?

By using corr() function we can get the correlation between two columns in the dataframe.

How do you calculate Pearson correlation coefficient in Pandas?

Pandas makes it very easy to find the correlation coefficient! We can simply call the . corr() method on the dataframe of interest. The method returns a correlation matrix that shows the coefficient of correlation between different variables.


1 Answers

Note there is a mistake in your data, there special col is all 3, so no correlation can be computed.

If you remove the column selection in the end you'll get a correlation matrix of all other columns you are analysing. The last [:-1] is to remove correlation of 'special_col' with itself.

In [15]: data[data.columns[1:]].corr()['special_col'][:-1] Out[15]:  stem1    0.500000 stem2   -0.500000 stem3   -0.999945 b1       0.500000 b2       0.500000 b3      -0.500000 Name: special_col, dtype: float64 

If you are interested in speed, this is slightly faster on my machine:

In [33]: np.corrcoef(data[data.columns[1:]].T)[-1][:-1] Out[33]:  array([ 0.5       , -0.5       , -0.99994535,  0.5       ,  0.5       ,        -0.5       ])  In [34]: %timeit np.corrcoef(data[data.columns[1:]].T)[-1][:-1] 1000 loops, best of 3: 437 µs per loop  In [35]: %timeit data[data.columns[1:]].corr()['special_col'] 1000 loops, best of 3: 526 µs per loop 

But obviously, it returns an array, not a pandas series/DF.

like image 120
Phlya Avatar answered Oct 05 '22 12:10

Phlya