Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correlations between numerous variables grouped in dplyr

Say I have a data frame, like this:

# Set RNG seed
set.seed(33550336)

# Create dummy data frame
df <- data.frame(PC1 = runif(20),
                 PC2 = runif(20),
                 PC3 = runif(20),
                 A = runif(20),
                 B = runif(20),
                 loc = sample(LETTERS[1:2], 20, replace = TRUE),
                 seas = sample(c("W", "S"), 20, replace = TRUE))

# > head(df)
#         PC1        PC2       PC3         A         B loc seas
# 1 0.8636470 0.02220823 0.7553348 0.4679607 0.0787467   A    S
# 2 0.3522257 0.42733152 0.2412971 0.6691419 0.1194121   A    W
# 3 0.5257408 0.44293320 0.3225228 0.0934192 0.2966507   B    S
# 4 0.0667227 0.90273594 0.6297959 0.1962124 0.4894373   A    W
# 5 0.3751383 0.50477920 0.6567203 0.4510632 0.4742191   B    S
# 6 0.9197086 0.32024904 0.8382138 0.9907894 0.9335657   A    S

I'm interested in calculating correlations between PC1, PC2, and PC3 and each of the variables A and B grouped by loc and seas. So, for example, based on this answer, I could do the following:

# Correlation of variable A and PC1 per loc & seas combination
df %>% 
  group_by(loc, seas) %>% 
  summarise(cor = cor(PC1, A)) %>% 
  ungroup

# # A tibble: 4 x 3
#   loc   seas      cor
#   <fct> <fct>   <dbl>
# 1 A     S      0.458 
# 2 A     W      0.748 
# 3 B     S     -0.0178
# 4 B     W     -0.450 

This gives me what I want: the correlation between PC1 and A for each combination of loc and seas. Awesome.

What I'm struggling with is extrapolating this to perform the calculation for each combination of PC* variables and other variables (i.e., A and B, in the example). My expected outcome is the tibble immediately above, but with a column for each combination for PC* and the other variables. I could do this long hand... cor(PC2, A), cor(PC3, A), cor(PC1, B), etc., but presumably there is there a succinct way of coding the calculation. I suspect it involves do, but I can't quite get my head around it... Can someone enlighten me?


Solution

I went with G. Grothendieck's solution below, but this required some restructuring to get it into the required format. I have posted the code I used here in case it is useful for others.

# Perform calculation
res <- by(df[1:5], df[-(1:5)], cor)

# Combinations of loc & seas 
comb <- expand.grid(dimnames(res))

#   loc seas
# 1   A    S
# 2   B    S
# 3   A    W
# 4   B    W

# A matrix corresponding to a loc & seas
# Plus the loc & seas themselves
restructure <- function(m, n){
  # Convert to data frame
  # Add rownames as column
  # Retains PCs as rows, but not columns
  # Gather variables to long format
  # Unite PC & variable names
  # Spread to a single row
  # Add combination of loc & seas
  m %>% 
    data.frame %>% 
    rownames_to_column() %>% 
    filter(grepl("PC", rownames(m))) %>% 
    select(-contains("PC")) %>% 
    gather(variable, value, -rowname) %>% 
    unite(comb, rowname, variable) %>% 
    spread(comb, value) %>% 
    bind_cols(n)
}

# Restructure each list element & combine into data frame
do.call(rbind, lapply(1:length(res), function(x)restructure(res[[x]], comb[x, ])))

which gives,

#         PC1_A       PC1_B      PC2_A       PC2_B      PC3_A     PC3_B loc seas
# 1  0.45763159 -0.00925106  0.3522161  0.20916667 -0.2003091 0.3741403   A    S
# 2 -0.01779813 -0.74328144 -0.3501188  0.46324158  0.8034240 0.4580262   B    S
# 3  0.74835455  0.49639477 -0.3994917 -0.05233889 -0.5902400 0.3606690   A    W
# 4 -0.45025181 -0.66721038 -0.9899521 -0.80989058  0.7606430 0.3738706   B    W
like image 710
Lyngbakr Avatar asked Dec 13 '18 15:12

Lyngbakr


People also ask

Can you do a correlation with multiple variables?

A correlation is usually tested for two variables at a time, but you can test correlations between three or more variables.

How do you find the correlation between all variables in R?

In this method to calculate the correlation between two variables, the user has to simply call the corr() function from the base R, passed with the required parameters which will be the name of the variables whose correlation is needed to be calculated and further this will be returning the correlation detail between ...

How do you compare correlations between groups?

Steps to compare Correlation Coefficient between Two GroupsFrom the menu at the top of the screen, click on Data, and then select Split File. Click on Compare Groups. Move the grouping variable (e.g. Gender) into the box labeled Groups based on. Click on OK.

What is correlation between columns pandas?

It is denoted by r and values between -1 and +1. A positive value for r indicates a positive association, and a negative value for r indicates a negative association. By using corr() function we can get the correlation between two columns in the dataframe.


2 Answers

Use by like this:

By <- by(df[1:5], df[-(1:5)], cor)

giving:

> By
loc: A
seas: S
            PC1        PC2        PC3          A           B
PC1  1.00000000 -0.3941583  0.1872622  0.4576316 -0.00925106
PC2 -0.39415826  1.0000000 -0.6797708  0.3522161  0.20916667
PC3  0.18726218 -0.6797708  1.0000000 -0.2003091  0.37414025
A    0.45763159  0.3522161 -0.2003091  1.0000000  0.57292305
B   -0.00925106  0.2091667  0.3741403  0.5729230  1.00000000
----------------------------------------------------------------------------------------------------------------------------- 
loc: B
seas: S
            PC1         PC2         PC3           A          B
PC1  1.00000000 -0.52651449  0.07120701 -0.01779813 -0.7432814
PC2 -0.52651449  1.00000000 -0.05448583 -0.35011878  0.4632416
PC3  0.07120701 -0.05448583  1.00000000  0.80342399  0.4580262
A   -0.01779813 -0.35011878  0.80342399  1.00000000  0.5558740
B   -0.74328144  0.46324158  0.45802622  0.55587404  1.0000000
----------------------------------------------------------------------------------------------------------------------------- 
loc: A
seas: W
           PC1         PC2        PC3          A           B
PC1  1.0000000 -0.79784422  0.0932317  0.7483545  0.49639477
PC2 -0.7978442  1.00000000 -0.3526315 -0.3994917 -0.05233889
PC3  0.0932317 -0.35263151  1.0000000 -0.5902400  0.36066898
A    0.7483545 -0.39949171 -0.5902400  1.0000000  0.18081316
B    0.4963948 -0.05233889  0.3606690  0.1808132  1.00000000
----------------------------------------------------------------------------------------------------------------------------- 
loc: B
seas: W
           PC1        PC2        PC3          A          B
PC1  1.0000000  0.3441459  0.1135686 -0.4502518 -0.6672104
PC2  0.3441459  1.0000000 -0.8447551 -0.9899521 -0.8098906
PC3  0.1135686 -0.8447551  1.0000000  0.7606430  0.3738706
A   -0.4502518 -0.9899521  0.7606430  1.0000000  0.8832408
B   -0.6672104 -0.8098906  0.3738706  0.8832408  1.0000000

ADDED

Based on further discussion by poster on what is wanted define the onerow function which accepts a correlation matrix or a data frame (in the latter case it converts the first 5 columns to a correlatoin matrix) producing one row of the output. The if statement in onerow is not needed, but won't hurt, for the adply line of code but we have included it so that onerow also works in a simple manner in subsequent examples below as well.

library(plyr)

onerow <- function(x) {
  if (is.data.frame(x)) x <- cor(x[1:5])
  dtab <- as.data.frame.table(x[4:5, 1:3])
  with(dtab, setNames(Freq, paste(Var2, Var1, sep = "_")))
}

adply(By, 1:2, onerow)

giving:

  loc seas       PC1_A       PC1_B      PC2_A       PC2_B      PC3_A     PC3_B
1   A    S  0.45763159 -0.00925106  0.3522161  0.20916667 -0.2003091 0.3741403
2   B    S -0.01779813 -0.74328144 -0.3501188  0.46324158  0.8034240 0.4580262
3   A    W  0.74835455  0.49639477 -0.3994917 -0.05233889 -0.5902400 0.3606690
4   B    W -0.45025181 -0.66721038 -0.9899521 -0.80989058  0.7606430 0.3738706

or perhaps get rid of by altogether and use this giving the same output:

library(plyr)
ddply(df, -(1:5), onerow)

or using dplyr:

library(dplyr)
df %>%
  group_by_at(-(1:5)) %>%
  do( onerow(.) %>% t %>% as.data.frame ) %>%
  ungroup
like image 142
G. Grothendieck Avatar answered Sep 27 '22 23:09

G. Grothendieck


We can do a split and cor in base R

lapply(split(df[1:5], df[-(1:5)]), cor)
#$A.S
#            PC1        PC2        PC3          A           B
#PC1  1.00000000 -0.3941583  0.1872622  0.4576316 -0.00925106
#PC2 -0.39415826  1.0000000 -0.6797708  0.3522161  0.20916667
#PC3  0.18726218 -0.6797708  1.0000000 -0.2003091  0.37414025
#A    0.45763159  0.3522161 -0.2003091  1.0000000  0.57292305
#B   -0.00925106  0.2091667  0.3741403  0.5729230  1.00000000

#$B.S
#            PC1         PC2         PC3           A          B
#PC1  1.00000000 -0.52651449  0.07120701 -0.01779813 -0.7432814
#PC2 -0.52651449  1.00000000 -0.05448583 -0.35011878  0.4632416
#PC3  0.07120701 -0.05448583  1.00000000  0.80342399  0.4580262
#A   -0.01779813 -0.35011878  0.80342399  1.00000000  0.5558740
#B   -0.74328144  0.46324158  0.45802622  0.55587404  1.0000000

#$A.W
#           PC1         PC2        PC3          A           B
#PC1  1.0000000 -0.79784422  0.0932317  0.7483545  0.49639477
#PC2 -0.7978442  1.00000000 -0.3526315 -0.3994917 -0.05233889
#PC3  0.0932317 -0.35263151  1.0000000 -0.5902400  0.36066898
#A    0.7483545 -0.39949171 -0.5902400  1.0000000  0.18081316
#B    0.4963948 -0.05233889  0.3606690  0.1808132  1.00000000

#$B.W
#           PC1        PC2        PC3          A          B
#PC1  1.0000000  0.3441459  0.1135686 -0.4502518 -0.6672104
#PC2  0.3441459  1.0000000 -0.8447551 -0.9899521 -0.8098906
#PC3  0.1135686 -0.8447551  1.0000000  0.7606430  0.3738706
#A   -0.4502518 -0.9899521  0.7606430  1.0000000  0.8832408
#B   -0.6672104 -0.8098906  0.3738706  0.8832408  1.0000000

Or using tidyverse

library(tidyverse)
df %>% 
    group_by_at(6:7) %>% 
    nest %>% 
    mutate(data = map(data, cor)) 
like image 24
akrun Avatar answered Sep 27 '22 23:09

akrun