Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to test correlation between Data X and Binary output Y?

I'm trying to find a Python method/library for testing correlation between the independent variables X and the binary output Y..

So for example, lets say I have the following data and output:

X           Y
0.65       1
0.11       0
0.13       0
0.35       1
0.21       0
...

Lets say the output Y is 1 if (X > 0.3) and 0 otherwise. If I don't know this correlation (the threshold value 0.3), is there a statistical method/test to find out the degree of correlation between X and Y?

So for example, some method that returns

x = [0.65, 0.11, 0.13, 0.31, 0.21]
y = [1, 0, 0, 1, 0]
print some_test(x, y)

==> returns "degree of correlation = 1.0"

Thanks

like image 674
user2436815 Avatar asked Mar 12 '15 22:03

user2436815


People also ask

Can you do correlation with binary variable?

The Point-Biserial Correlation Coefficient is a correlation measure of the strength of association between a continuous-level variable (ratio or interval data) and a binary variable.

How do you find the correlation coefficient between X and Y?

The correlation of X and Y is the normalized covariance: Corr(X,Y) = Cov(X,Y) / σXσY . The correlation of a pair of random variables is a dimensionless number, ranging between +1 and -1.

How do you find the correlation between categorical and numerical data?

If your categorical variable is dichotomous (only two values), then you can use the point-biserial correlation. There is a function to do this in the ltm package. You could do a logistic regression and use various evaluations of it (accuracy, etc.) in place of a correlation coefficient.

Can we find correlation between 2 categorical variables?

Tetrachoric correlation is used to calculate the correlation between binary categorical variables. Recall that binary variables are variables that can only take on one of two possible values.


1 Answers

You are looking for a point biserial correlation, which is used when one of your variables is dichotomous.

from scipy import stats
stats.pointbiserialr(x,y)

If you simply want to know whether X is different depending on the value of Y, you should instead use a t-test.

like image 87
Jeff Avatar answered Oct 12 '22 06:10

Jeff