Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Plot the equivalent of correlation matrix for factors (categorical data)? And mixed types?

Actually there are 2 questions, one is more advanced than the other.

Q1: I am looking for a method that similar to corrplot() but can deal with factors.

I originally tried to use chisq.test() then calculate the p-value and Cramer's V as correlation, but there too many columns to figure out. So could anyone tell me if there is a quick way to create a "corrplot" that each cell contains the value of Cramer's V, while the colour is rendered by p-value. Or any other kind of similar plot.

Regarding Cramer's V, let's say tbl is a 2-dimensional factor data frame.

chi2 <- chisq.test(tbl, correct=F)
Cramer_V <- sqrt(chi2$/nrow(tbl)) 

I prepared a test data frame with factors:

df <- data.frame(
group = c('A', 'A', 'A', 'A', 'A', 'B', 'C'),
student = c('01', '01', '01', '02', '02', '01', '02'),
exam_pass = c('Y', 'N', 'Y', 'N', 'Y', 'Y', 'N'),
subject = c('Math', 'Science', 'Japanese', 'Math', 'Science', 'Japanese', 'Math')
) 

Q2: Then I would like to compute a correlation/association matrix on a mixed-types dataframe e.g.:

df <- data.frame(
group = c('A', 'A', 'A', 'A', 'A', 'B', 'C'),
student = c('01', '01', '01', '02', '02', '01', '02'),
exam_pass = c('Y', 'N', 'Y', 'N', 'Y', 'Y', 'N'),
subject = c('Math', 'Science', 'Japanese', 'Math', 'Science', 'Japanese', 'Math')
) 
df$group <- factor(df$group, levels = c('A', 'B', 'C'), ordered = T)
df$student <- as.integer(df$student)
like image 894
J.D Avatar asked Sep 28 '18 11:09

J.D


People also ask

Can you do a correlation plot with categorical data?

Yes, it is possible if you also keep the variable type in a column and you pick the appropriate correlation method based on the types.

How do you find the correlation between categorical and categorical variables?

There are three big-picture methods to understand if a continuous and categorical are significantly correlated — point biserial correlation, logistic regression, and Kruskal Wallis H Test. The point biserial correlation coefficient is a special case of Pearson's correlation coefficient.

How do you plot a correlation matrix?

Plotting Correlation Matrix First, find the correlation between each variable available in the dataframe using the corr() method. The corr() method will give a matrix with the correlation values between each variable. Now, set the background gradient for the correlation data.

Can you have a correlation matrix with categorical variables?

Correlation Matrix of Categorical Variables Only. To generate the correlation matrix for only categorical variables, We are going to filter out all the categorical variables in a separate data frame.


1 Answers

If you want to have a genuine correlation plot for factors or mixed-type, you can also use model.matrix to one-hot encode all non-numeric variables. This is quite different than calculating Cramér's V as it will consider your factor as separate variables, as many regression models do.

You can then use your favorite correlation-plot library. I personally like ggcorrplot for its ggplot2 compatibility.

Here is an example with your dataset:

library(ggcorrplot)
model.matrix(~0+., data=df) %>% 
  cor(use="pairwise.complete.obs") %>% 
  ggcorrplot(show.diag = F, type="lower", lab=TRUE, lab_size=2)

enter image description here

like image 54
Dan Chaltiel Avatar answered Sep 22 '22 23:09

Dan Chaltiel