I'm trying to find a Python method/library for testing correlation between the independent variables X and the binary output Y..
So for example, lets say I have the following data and output:
X Y
0.65 1
0.11 0
0.13 0
0.35 1
0.21 0
...
Lets say the output Y is 1 if (X > 0.3) and 0 otherwise. If I don't know this correlation (the threshold value 0.3), is there a statistical method/test to find out the degree of correlation between X and Y?
So for example, some method that returns
x = [0.65, 0.11, 0.13, 0.31, 0.21]
y = [1, 0, 0, 1, 0]
print some_test(x, y)
==> returns "degree of correlation = 1.0"
Thanks
The Point-Biserial Correlation Coefficient is a correlation measure of the strength of association between a continuous-level variable (ratio or interval data) and a binary variable.
The correlation of X and Y is the normalized covariance: Corr(X,Y) = Cov(X,Y) / σXσY . The correlation of a pair of random variables is a dimensionless number, ranging between +1 and -1.
If your categorical variable is dichotomous (only two values), then you can use the point-biserial correlation. There is a function to do this in the ltm package. You could do a logistic regression and use various evaluations of it (accuracy, etc.) in place of a correlation coefficient.
Tetrachoric correlation is used to calculate the correlation between binary categorical variables. Recall that binary variables are variables that can only take on one of two possible values.
You are looking for a point biserial correlation, which is used when one of your variables is dichotomous.
from scipy import stats
stats.pointbiserialr(x,y)
If you simply want to know whether X is different depending on the value of Y, you should instead use a t-test.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With