I have some categorical features in my data along with continuous ones. Is it a good or absolutely bad idea to hot encode category features to find correlation of it to labels along with other continuous creatures?
There are three metrics that are commonly used to calculate the correlation between categorical variables: 1. Tetrachoric Correlation: Used to calculate the correlation between binary categorical variables. 2. Polychoric Correlation: Used to calculate the correlation between ordinal categorical variables.
According to The Search for Categorical Correlation post on TowardsDataScience, one can use a variation of correlation called Cramer's association. What we need is something that will look like correlation, but will work with categorical values — or more formally, we’re looking for a measure of association between two categorical features.
To measure the relationship between numeric variable and categorical variable with > 2 levels you should use eta correlation (square root of the R2 of the multifactorial regression). If the categorical variable has 2 levels, point-biserial correlation is used (equivalent to the Pearson correlation).
There are two popular feature selection techniques that can be used for categorical input data and a categorical (class) target variable. Chi-Squared Statistic. Mutual Information Statistic.
There is a way to calculate the correlation coefficient without one-hot encoding the category variable. Cramers V statistic is one method for calculating the correlation of categorical variables. It can be calculated as follows. The following link is helpful. Using pandas, calculate Cramér's coefficient matrix For variables with other continuous values, you can categorize by using cut
of pandas
.
import numpy as np import pandas as pd import scipy.stats as ss import seaborn as sns print('Pandas version:', pd.__version__) # Pandas version: 1.3.0 tips = sns.load_dataset("tips") tips["total_bill_cut"] = pd.cut(tips["total_bill"], np.arange(0, 55, 5), include_lowest=True, right=False) def cramers_v(confusion_matrix): """ calculate Cramers V statistic for categorial-categorial association. uses correction from Bergsma and Wicher, Journal of the Korean Statistical Society 42 (2013): 323-328 """ chi2 = ss.chi2_contingency(confusion_matrix)[0] n = confusion_matrix.sum() phi2 = chi2 / n r, k = confusion_matrix.shape phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1)) rcorr = r - ((r-1)**2)/(n-1) kcorr = k - ((k-1)**2)/(n-1) return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1))) confusion_matrix = pd.crosstab(tips["day"], tips["time"]) cramers_v(confusion_matrix.values) # Out[2]: 0.9386619340722221 confusion_matrix = pd.crosstab(tips["total_bill_cut"], tips["time"]) cramers_v(confusion_matrix.values) # Out[3]: 0.1649870749498837
please note the .as_matrix()
is deprecated in pandas since verison 0.23.0 . use .values
instead
I found phik
library quite useful in calculating correlation between categorical and interval features. This is also useful for binning numerical features. Try this once: phik documentation
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With