How to do Pearson correlation of selected columns of a Pandas data frame

Tags:

I have a CSV that looks like this:

gene,stem1,stem2,stem3,b1,b2,b3,special_col foo,20,10,11,23,22,79,3 bar,17,13,505,12,13,88,1 qui,17,13,5,12,13,88,3

And as data frame it looks like this:

In [17]: import pandas as pd In [20]: df = pd.read_table("http://dpaste.com/3PQV3FA.txt",sep=",") In [21]: df Out[21]:   gene  stem1  stem2  stem3  b1  b2  b3  special_col 0  foo     20     10     11  23  22  79            3 1  bar     17     13    505  12  13  88            1 2  qui     17     13      5  12  13  88            3

What I want to do is to perform pearson correlation from last column (special_col) with every columns between gene column and special column, i.e. colnames[1:number_of_column-1]

At the end of the day we will have length 6 data frame.

Coln   PearCorr stem1  0.5 stem2 -0.5 stem3 -0.9999453506011533 b1    0.5 b2    0.5 b3    -0.5

The above value is computed manually:

In [27]: import scipy.stats In [39]: scipy.stats.pearsonr([3, 1, 3], [11,505,5]) Out[39]: (-0.9999453506011533, 0.0066556395400007278)

How can I do that?

834

asked Jan 20 '16 09:01

neversaint

1 Answers

~~Note there is a mistake in your data, there special col is all 3, so no correlation can be computed.~~

If you remove the column selection in the end you'll get a correlation matrix of all other columns you are analysing. The last [:-1] is to remove correlation of 'special_col' with itself.

In [15]: data[data.columns[1:]].corr()['special_col'][:-1] Out[15]:  stem1    0.500000 stem2   -0.500000 stem3   -0.999945 b1       0.500000 b2       0.500000 b3      -0.500000 Name: special_col, dtype: float64

If you are interested in speed, this is slightly faster on my machine:

In [33]: np.corrcoef(data[data.columns[1:]].T)[-1][:-1] Out[33]:  array([ 0.5       , -0.5       , -0.99994535,  0.5       ,  0.5       ,        -0.5       ])  In [34]: %timeit np.corrcoef(data[data.columns[1:]].T)[-1][:-1] 1000 loops, best of 3: 437 µs per loop  In [35]: %timeit data[data.columns[1:]].corr()['special_col'] 1000 loops, best of 3: 526 µs per loop

But obviously, it returns an array, not a pandas series/DF.

120

answered Oct 05 '22 12:10

Phlya

Related questions
                            
                                Typescript: Force Default Generic Type to be `any` instead of `{}`
                            
                                Iterating over two lists one after another
                            
                                Change Package directory in Julia
                            
                                Is it possible to validate list using marshmallow?
                            
                                how to have bold and normal text in same textview in android?
                            
                                Enums support with Realm?
                            
                                AttributeError: Unknown property legend in seaborn
                            
                                Globally configure NPM with a token registry to a specific scope (@organisation)
                            
                                How to get headers of the response from fetch
                            
                                npm version to add alpha postfix
                            
                                calculating Gini coefficient in Python/numpy
                            
                                Can I develop a private action only accessible via my google home or linked account?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With