Two binary variables (x and y) form two columns for a number of dates in a pandas Dataframe. I want to calculate a correlation score between x and y that quantifies how correlated x=1 is with y=1 ( x=0 with y=0).
What definition of correlation is appropriate?
Is there a built-in function?
| day | _x | _ y | 
|---|---|---|
| 0 | 1 | 1 | 
| 1 | 1 | 0 | 
| 2 | 0 | 0 | 
| 3 | 1 | 1 | 
Explanation: These are two categoricals. say, x = had eggs for breakfast (0 or 1) and y = got a headache (0 or 1). And there data from several days for both x and y. I'm trying to see how 'strongly correlated' having an eggs and having a headache are. I understand that Pearson's correlation is not applicable here. What could be used?
The correlation metric to use in this case is Pearson's rho. Defined for two binary variables, it is also known as Pearson's correlation coeffecient.
rho = (n11*n00 -  n10*n01)/sqrt(n11.n10.n01.n00)
where 
n11 (n00) = number of rows with x=1(0) and y=1(0) etc. 
https://en.wikipedia.org/wiki/Phi_coefficient
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With