Categorical features correlation

Tags:

I have some categorical features in my data along with continuous ones. Is it a good or absolutely bad idea to hot encode category features to find correlation of it to labels along with other continuous creatures?

587

asked Sep 30 '17 00:09

user8653080

2 Answers

There is a way to calculate the correlation coefficient without one-hot encoding the category variable. Cramers V statistic is one method for calculating the correlation of categorical variables. It can be calculated as follows. The following link is helpful. Using pandas, calculate Cramér's coefficient matrix For variables with other continuous values, you can categorize by using cut of pandas.

import numpy as np import pandas as pd import scipy.stats as ss import seaborn as sns  print('Pandas version:', pd.__version__) # Pandas version: 1.3.0  tips = sns.load_dataset("tips")  tips["total_bill_cut"] = pd.cut(tips["total_bill"],                                 np.arange(0, 55, 5),                                 include_lowest=True,                                 right=False)  def cramers_v(confusion_matrix):     """ calculate Cramers V statistic for categorial-categorial association.         uses correction from Bergsma and Wicher,         Journal of the Korean Statistical Society 42 (2013): 323-328     """     chi2 = ss.chi2_contingency(confusion_matrix)[0]     n = confusion_matrix.sum()     phi2 = chi2 / n     r, k = confusion_matrix.shape     phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))     rcorr = r - ((r-1)**2)/(n-1)     kcorr = k - ((k-1)**2)/(n-1)     return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))  confusion_matrix = pd.crosstab(tips["day"], tips["time"]) cramers_v(confusion_matrix.values) # Out[2]: 0.9386619340722221  confusion_matrix = pd.crosstab(tips["total_bill_cut"], tips["time"]) cramers_v(confusion_matrix.values) # Out[3]: 0.1649870749498837

please note the .as_matrix() is deprecated in pandas since verison 0.23.0 . use .values instead

166

answered Sep 18 '22 21:09

Keiku

I found phik library quite useful in calculating correlation between categorical and interval features. This is also useful for binning numerical features. Try this once: phik documentation

answered Sep 20 '22 21:09

Ricky

Related questions
                            
                                module 'pandas' has no attribute 'Panel'
                            
                                Getting % Rate using Pandas Group By and .sum()
                            
                                Pandas: add new column with count how often the highest score of a day was reached by this person
                            
                                Pandas read_excel function ignoring dtype
                            
                                How to split the string by '/' and reform it by the split substrings in a dataframe?
                            
                                Replacing values in a data.frame according to a value from an other data.frame with the same shape (Python)
                            
                                Pandas explode dictionary to rows
                            
                                Convert result from groupby on multiple columns to list of dictionaries
                            
                                Pandas group by and sum, but create a new row when a certain amount is exceeded
                            
                                Create hierarchy column in pandas
                            
                                Troubles with downloading and saving a document in django
                            
                                How to rename the first column of a pandas dataframe?
                            
                                Pandas: reading multi-index JSON as pandas data frame
                            
                                Pandas Weighted Stats
                            
                                DataFrame pairs of columns division
                            
                                Recursive definitions in Pandas
                            
                                how to change the order of factor plot in seaborn
                            
                                KeyError: "None of [['', '']] are in the [columns]" pandas python
                            
                                Python Pandas - Using to_sql to write large data frames in chunks
                            
                                Remove Outliers in Pandas DataFrame using Percentiles [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Categorical features correlation

Tags:

pandas

machine-learning

categorical-data

feature-engineering

user8653080

People also ask

2 Answers

Keiku

Ricky

Recent Activity

Donate For Us