Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correlation among multiple categorical variables (Pandas)

Tags:

my original dataset

I have a data set made of 22 categorical variables (non-ordered). I would like to visualize their correlation in a nice heatmap. Since the Pandas built-in function

DataFrame.corr(method='pearson', min_periods=1) 

only implement correlation coefficients for numerical variables (Pearson, Kendall, Spearman), I have to aggregate it myself to perform a chi-square or something like it and I am not quite sure which function use to do it in one elegant step (rather than iterating through all the cat1*cat2 pairs). To be clear, this is what I would like to end up with (a dataframe):

         cat1  cat2  cat3     cat1|  coef  coef  coef     cat2|  coef  coef  coef   cat3|  coef  coef  coef 

Any ideas with pd.pivot_table or something in the same vein?

thanks in advance D.

like image 857
zar3bski Avatar asked Dec 30 '17 15:12

zar3bski


People also ask

Can I use correlation for categorical data?

The reason you can't run correlations on, say, one continuous and one categorical variable is because it's not possible to calculate the covariance between the two, since the categorical variable by definition cannot yield a mean, and thus cannot even enter into the first steps of the statistical analysis.

Can you do a Spearman correlation for categorical data?

If the categorical variable has two categories (dichotomous), you can use the Pearson correlation or Spearman correlation.


1 Answers

You can using pd.factorize

df.apply(lambda x : pd.factorize(x)[0]).corr(method='pearson', min_periods=1) Out[32]:       a    c    d a  1.0  1.0  1.0 c  1.0  1.0  1.0 d  1.0  1.0  1.0 

Data input

df=pd.DataFrame({'a':['a','b','c'],'c':['a','b','c'],'d':['a','b','c']}) 

Update

from scipy.stats import chisquare  df=df.apply(lambda x : pd.factorize(x)[0])+1  pd.DataFrame([chisquare(df[x].values,f_exp=df.values.T,axis=1)[0] for x in df])  Out[123]:       0    1    2    3 0  0.0  0.0  0.0  0.0 1  0.0  0.0  0.0  0.0 2  0.0  0.0  0.0  0.0 3  0.0  0.0  0.0  0.0  df=pd.DataFrame({'a':['a','d','c'],'c':['a','b','c'],'d':['a','b','c'],'e':['a','b','c']}) 
like image 133
BENY Avatar answered Oct 09 '22 04:10

BENY