I have a data set made of 22 categorical variables (non-ordered). I would like to visualize their correlation in a nice heatmap. Since the Pandas built-in function
DataFrame.corr(method='pearson', min_periods=1)
only implement correlation coefficients for numerical variables (Pearson, Kendall, Spearman), I have to aggregate it myself to perform a chi-square or something like it and I am not quite sure which function use to do it in one elegant step (rather than iterating through all the cat1*cat2 pairs). To be clear, this is what I would like to end up with (a dataframe):
cat1 cat2 cat3 cat1| coef coef coef cat2| coef coef coef cat3| coef coef coef
Any ideas with pd.pivot_table or something in the same vein?
thanks in advance D.
The reason you can't run correlations on, say, one continuous and one categorical variable is because it's not possible to calculate the covariance between the two, since the categorical variable by definition cannot yield a mean, and thus cannot even enter into the first steps of the statistical analysis.
If the categorical variable has two categories (dichotomous), you can use the Pearson correlation or Spearman correlation.
You can using pd.factorize
df.apply(lambda x : pd.factorize(x)[0]).corr(method='pearson', min_periods=1) Out[32]: a c d a 1.0 1.0 1.0 c 1.0 1.0 1.0 d 1.0 1.0 1.0
Data input
df=pd.DataFrame({'a':['a','b','c'],'c':['a','b','c'],'d':['a','b','c']})
Update
from scipy.stats import chisquare df=df.apply(lambda x : pd.factorize(x)[0])+1 pd.DataFrame([chisquare(df[x].values,f_exp=df.values.T,axis=1)[0] for x in df]) Out[123]: 0 1 2 3 0 0.0 0.0 0.0 0.0 1 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 3 0.0 0.0 0.0 0.0 df=pd.DataFrame({'a':['a','d','c'],'c':['a','b','c'],'d':['a','b','c'],'e':['a','b','c']})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With