Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correlation coefficient of two columns in pandas dataframe with .corr()

I would like to calculate the correlation coefficient between two columns of a pandas data frame after making a column boolean in nature. The original table had two columns: a Group Column with one of two treatment groups, now boolean, and an Age Group. Those are the two columns I'm looking to calculate the correlation coefficient.

I tried the .corr() method, with:

table.corr(method='pearson')

but have this returned to me: enter image description here

I have pasted the first 25 rows of boolean table below. I don't know if I'm missing parameters, or how to interpret this result. It's also strange that it's 1 as well. Thanks in advance!

    Group  Age
0      1   50
1      1   59
2      1   22
3      1   48
4      1   53
5      1   48
6      1   29
7      1   44
8      1   28
9      1   42
10     1   35
11     0   54
12     0   43
13     1   50
14     1   62
15     0   64
16     0   39
17     1   40
18     1   59
19     1   46
20     0   56
21     1   21
22     1   45
23     0   41
24     1   46
25     0   35
like image 834
florence-y Avatar asked Mar 18 '18 16:03

florence-y


1 Answers

Calling .corr() on the entire DataFrame gives you a full correlation matrix:

>>> table.corr()
        Group     Age
Group  1.0000 -0.1533
Age   -0.1533  1.0000

You can use the separate Series instead:

>>> table['Group'].corr(table['Age'])
-0.15330486289034567

This should be faster than using the full matrix and indexing it (with df.corr().iat['Group', 'Age']). Also, this should work whether Group is bool or int dtype.

like image 50
Brad Solomon Avatar answered Nov 09 '22 05:11

Brad Solomon