Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Categorical features correlation

I have some categorical features in my data along with continuous ones. Is it a good or absolutely bad idea to hot encode category features to find correlation of it to labels along with other continuous creatures?

like image 587
user8653080 Avatar asked Sep 30 '17 00:09

user8653080


People also ask

How do you calculate the correlation between categorical variables?

There are three metrics that are commonly used to calculate the correlation between categorical variables: 1. Tetrachoric Correlation: Used to calculate the correlation between binary categorical variables. 2. Polychoric Correlation: Used to calculate the correlation between ordinal categorical variables.

What is the best way to find the association between categorical features?

According to The Search for Categorical Correlation post on TowardsDataScience, one can use a variation of correlation called Cramer's association. What we need is something that will look like correlation, but will work with categorical values — or more formally, we’re looking for a measure of association between two categorical features.

How do you find the relationship between categorical and numeric variables?

To measure the relationship between numeric variable and categorical variable with > 2 levels you should use eta correlation (square root of the R2 of the multifactorial regression). If the categorical variable has 2 levels, point-biserial correlation is used (equivalent to the Pearson correlation).

What is the Best Feature selection for categorical data?

There are two popular feature selection techniques that can be used for categorical input data and a categorical (class) target variable. Chi-Squared Statistic. Mutual Information Statistic.


2 Answers

There is a way to calculate the correlation coefficient without one-hot encoding the category variable. Cramers V statistic is one method for calculating the correlation of categorical variables. It can be calculated as follows. The following link is helpful. Using pandas, calculate Cramér's coefficient matrix For variables with other continuous values, you can categorize by using cut of pandas.

import numpy as np import pandas as pd import scipy.stats as ss import seaborn as sns  print('Pandas version:', pd.__version__) # Pandas version: 1.3.0  tips = sns.load_dataset("tips")  tips["total_bill_cut"] = pd.cut(tips["total_bill"],                                 np.arange(0, 55, 5),                                 include_lowest=True,                                 right=False)  def cramers_v(confusion_matrix):     """ calculate Cramers V statistic for categorial-categorial association.         uses correction from Bergsma and Wicher,         Journal of the Korean Statistical Society 42 (2013): 323-328     """     chi2 = ss.chi2_contingency(confusion_matrix)[0]     n = confusion_matrix.sum()     phi2 = chi2 / n     r, k = confusion_matrix.shape     phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))     rcorr = r - ((r-1)**2)/(n-1)     kcorr = k - ((k-1)**2)/(n-1)     return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))  confusion_matrix = pd.crosstab(tips["day"], tips["time"]) cramers_v(confusion_matrix.values) # Out[2]: 0.9386619340722221  confusion_matrix = pd.crosstab(tips["total_bill_cut"], tips["time"]) cramers_v(confusion_matrix.values) # Out[3]: 0.1649870749498837 

please note the .as_matrix() is deprecated in pandas since verison 0.23.0 . use .values instead

like image 166
Keiku Avatar answered Sep 18 '22 21:09

Keiku


I found phik library quite useful in calculating correlation between categorical and interval features. This is also useful for binning numerical features. Try this once: phik documentation

like image 25
Ricky Avatar answered Sep 20 '22 21:09

Ricky