Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to plot a Cramer’s V heatmap for categorical features?

The association between categorical variables should be computed using Crammer's V. Therefore, I found the following code to plot it, but I don't know why he plotted it for "contribution", which is a numeric variable?

def cramers_corrected_stat(confusion_matrix):
    """ calculate Cramers V statistic for categorical-categorical association.
        uses correction from Bergsma and Wicher, 
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))


cols = ["Party", "Vote", "contrib"]
corrM = np.zeros((len(cols),len(cols)))
# there's probably a nice pandas way to do this
for col1, col2 in itertools.combinations(cols, 2):
    idx1, idx2 = cols.index(col1), cols.index(col2)
    corrM[idx1, idx2] = cramers_corrected_stat(pd.crosstab(df[col1], df[col2]))
    corrM[idx2, idx1] = corrM[idx1, idx2]

corr = pd.DataFrame(corrM, index=cols, columns=cols)
fig, ax = plt.subplots(figsize=(7, 6))
ax = sns.heatmap(corr, annot=True, ax=ax); ax.set_title("Cramer V Correlation between Variables");

I also found Bokeh. However, I am not sure if it uses Crammer's V to plot the heatmap or not?

Really, I have two categorical features: the first one has 2 categories and the second one has 37 categories. Could you please let me know how to plot Crammer's V heatmap?

Some part of my dataset is here.

Thanks in advance.

like image 641
ebrahimi Avatar asked Aug 15 '18 13:08

ebrahimi


People also ask

Can heatmap be used for categorical data?

If we want to see how categorical variables interact with each other, heatmaps are a very useful way to do so. While you can use a heatmap to visualize the relationship between any two categorical variables, it's quite common to use heatmaps across dimensions of time.

Can I use correlation for categorical data?

The reason you can't run correlations on, say, one continuous and one categorical variable is because it's not possible to calculate the covariance between the two, since the categorical variable by definition cannot yield a mean, and thus cannot even enter into the first steps of the statistical analysis.

How do you find the correlation between categorical variables in Python?

If a categorical variable only has two values (i.e. true/false), then we can convert it into a numeric datatype (0 and 1). Since it becomes a numeric variable, we can find out the correlation using the dataframe. corr() function.


1 Answers

What's the problem? The code is absolutely right.

ax in this case ia a correlation matrix beetwen variables. Using "contribution" is not correct but you can see in the article bellow Quote

*

"This isn't right to do on the Contribution variable, but we'll do more with a model later."

* The author shows this variable for example only. In your case what's the reason to make plot Crammer's V? You have just two variables (as I see) and you will get only one correlation coefficient Crammer's V

But of course you can repeat the code on your data and get plot Crammer's V heatmap

like image 195
Edward Avatar answered Oct 15 '22 16:10

Edward