The association between categorical variables should be computed using Crammer's V. Therefore, I found the following code to plot it, but I don't know why he plotted it for "contribution", which is a numeric variable?
def cramers_corrected_stat(confusion_matrix):
""" calculate Cramers V statistic for categorical-categorical association.
uses correction from Bergsma and Wicher,
Journal of the Korean Statistical Society 42 (2013): 323-328
"""
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum().sum()
phi2 = chi2/n
r,k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))
cols = ["Party", "Vote", "contrib"]
corrM = np.zeros((len(cols),len(cols)))
# there's probably a nice pandas way to do this
for col1, col2 in itertools.combinations(cols, 2):
idx1, idx2 = cols.index(col1), cols.index(col2)
corrM[idx1, idx2] = cramers_corrected_stat(pd.crosstab(df[col1], df[col2]))
corrM[idx2, idx1] = corrM[idx1, idx2]
corr = pd.DataFrame(corrM, index=cols, columns=cols)
fig, ax = plt.subplots(figsize=(7, 6))
ax = sns.heatmap(corr, annot=True, ax=ax); ax.set_title("Cramer V Correlation between Variables");
I also found Bokeh. However, I am not sure if it uses Crammer's V to plot the heatmap or not?
Really, I have two categorical features: the first one has 2 categories and the second one has 37 categories. Could you please let me know how to plot Crammer's V heatmap?
Some part of my dataset is here.
Thanks in advance.
If we want to see how categorical variables interact with each other, heatmaps are a very useful way to do so. While you can use a heatmap to visualize the relationship between any two categorical variables, it's quite common to use heatmaps across dimensions of time.
The reason you can't run correlations on, say, one continuous and one categorical variable is because it's not possible to calculate the covariance between the two, since the categorical variable by definition cannot yield a mean, and thus cannot even enter into the first steps of the statistical analysis.
If a categorical variable only has two values (i.e. true/false), then we can convert it into a numeric datatype (0 and 1). Since it becomes a numeric variable, we can find out the correlation using the dataframe. corr() function.
What's the problem? The code is absolutely right.
ax
in this case ia a correlation matrix beetwen variables.
Using "contribution" is not correct but you can see in the article bellow
Quote
*
"This isn't right to do on the Contribution variable, but we'll do more with a model later."
* The author shows this variable for example only. In your case what's the reason to make plot Crammer's V? You have just two variables (as I see) and you will get only one correlation coefficient Crammer's V
But of course you can repeat the code on your data and get plot Crammer's V heatmap
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With