Say I have a data frame, like this: <pre class="prettyprint"><code># Set RNG seed set.seed(33550336) # Create dummy data frame df <- data.frame(PC1 = runif(20), PC2 = runif(20), PC3 = runif(20), A = runif(20), B = runif(20), loc = sample(LETTERS[1:2], 20, replace = TRUE), seas = sample(c("W", "S"), 20, replace = TRUE)) # > head(df) # PC1 PC2 PC3 A B loc seas # 1 0.8636470 0.02220823 0.7553348 0.4679607 0.0787467 A S # 2 0.3522257 0.42733152 0.2412971 0.6691419 0.1194121 A W # 3 0.5257408 0.44293320 0.3225228 0.0934192 0.2966507 B S # 4 0.0667227 0.90273594 0.6297959 0.1962124 0.4894373 A W # 5 0.3751383 0.50477920 0.6567203 0.4510632 0.4742191 B S # 6 0.9197086 0.32024904 0.8382138 0.9907894 0.9335657 A S </code></pre> I'm interested in calculating correlations between <code>PC1</code>, <code>PC2</code>, and <code>PC3</code> and each of the variables <code>A</code> and <code>B</code> grouped by <code>loc</code> and <code>seas</code>. So, for example, based on this answer, I could do the following: <pre class="prettyprint"><code># Correlation of variable A and PC1 per loc & seas combination df %>% group_by(loc, seas) %>% summarise(cor = cor(PC1, A)) %>% ungroup # # A tibble: 4 x 3 # loc seas cor # <fct> <fct> <dbl> # 1 A S 0.458 # 2 A W 0.748 # 3 B S -0.0178 # 4 B W -0.450 </code></pre> This gives me what I want: the correlation between <code>PC1</code> and <code>A</code> for each combination of <code>loc</code> and <code>seas</code>. Awesome. What I'm struggling with is extrapolating this to perform the calculation for each combination of <code>PC*</code> variables and other variables (i.e., <code>A</code> and <code>B</code>, in the example). My expected outcome is the tibble immediately above, but with a column for each combination for <code>PC*</code> and the other variables. I could do this long hand... <code>cor(PC2, A)</code>, <code>cor(PC3, A)</code>, <code>cor(PC1, B)</code>, etc., but presumably there is there a succinct way of coding the calculation. I suspect it involves <code>do</code>, but I can't quite get my head around it... Can someone enlighten me? <hr> <h3>Solution</h3> I went with G. Grothendieck's solution below, but this required some restructuring to get it into the required format. I have posted the code I used here in case it is useful for others. <pre class="prettyprint"><code># Perform calculation res <- by(df[1:5], df[-(1:5)], cor) # Combinations of loc & seas comb <- expand.grid(dimnames(res)) # loc seas # 1 A S # 2 B S # 3 A W # 4 B W # A matrix corresponding to a loc & seas # Plus the loc & seas themselves restructure <- function(m, n){ # Convert to data frame # Add rownames as column # Retains PCs as rows, but not columns # Gather variables to long format # Unite PC & variable names # Spread to a single row # Add combination of loc & seas m %>% data.frame %>% rownames_to_column() %>% filter(grepl("PC", rownames(m))) %>% select(-contains("PC")) %>% gather(variable, value, -rowname) %>% unite(comb, rowname, variable) %>% spread(comb, value) %>% bind_cols(n) } # Restructure each list element & combine into data frame do.call(rbind, lapply(1:length(res), function(x)restructure(res[[x]], comb[x, ]))) </code></pre> which gives, <pre class="prettyprint"><code># PC1_A PC1_B PC2_A PC2_B PC3_A PC3_B loc seas # 1 0.45763159 -0.00925106 0.3522161 0.20916667 -0.2003091 0.3741403 A S # 2 -0.01779813 -0.74328144 -0.3501188 0.46324158 0.8034240 0.4580262 B S # 3 0.74835455 0.49639477 -0.3994917 -0.05233889 -0.5902400 0.3606690 A W # 4 -0.45025181 -0.66721038 -0.9899521 -0.80989058 0.7606430 0.3738706 B W </code></pre>

Use <code>by</code> like this: <pre class="prettyprint"><code>By <- by(df[1:5], df[-(1:5)], cor) </code></pre> giving: <pre class="prettyprint"><code>> By loc: A seas: S PC1 PC2 PC3 A B PC1 1.00000000 -0.3941583 0.1872622 0.4576316 -0.00925106 PC2 -0.39415826 1.0000000 -0.6797708 0.3522161 0.20916667 PC3 0.18726218 -0.6797708 1.0000000 -0.2003091 0.37414025 A 0.45763159 0.3522161 -0.2003091 1.0000000 0.57292305 B -0.00925106 0.2091667 0.3741403 0.5729230 1.00000000 ----------------------------------------------------------------------------------------------------------------------------- loc: B seas: S PC1 PC2 PC3 A B PC1 1.00000000 -0.52651449 0.07120701 -0.01779813 -0.7432814 PC2 -0.52651449 1.00000000 -0.05448583 -0.35011878 0.4632416 PC3 0.07120701 -0.05448583 1.00000000 0.80342399 0.4580262 A -0.01779813 -0.35011878 0.80342399 1.00000000 0.5558740 B -0.74328144 0.46324158 0.45802622 0.55587404 1.0000000 ----------------------------------------------------------------------------------------------------------------------------- loc: A seas: W PC1 PC2 PC3 A B PC1 1.0000000 -0.79784422 0.0932317 0.7483545 0.49639477 PC2 -0.7978442 1.00000000 -0.3526315 -0.3994917 -0.05233889 PC3 0.0932317 -0.35263151 1.0000000 -0.5902400 0.36066898 A 0.7483545 -0.39949171 -0.5902400 1.0000000 0.18081316 B 0.4963948 -0.05233889 0.3606690 0.1808132 1.00000000 ----------------------------------------------------------------------------------------------------------------------------- loc: B seas: W PC1 PC2 PC3 A B PC1 1.0000000 0.3441459 0.1135686 -0.4502518 -0.6672104 PC2 0.3441459 1.0000000 -0.8447551 -0.9899521 -0.8098906 PC3 0.1135686 -0.8447551 1.0000000 0.7606430 0.3738706 A -0.4502518 -0.9899521 0.7606430 1.0000000 0.8832408 B -0.6672104 -0.8098906 0.3738706 0.8832408 1.0000000 </code></pre> <h3>ADDED</h3> Based on further discussion by poster on what is wanted define the <code>onerow</code> function which accepts a correlation matrix or a data frame (in the latter case it converts the first 5 columns to a correlatoin matrix) producing one row of the output. The <code>if</code> statement in <code>onerow</code> is not needed, but won't hurt, for the <code>adply</code> line of code but we have included it so that <code>onerow</code> also works in a simple manner in subsequent examples below as well. <pre class="prettyprint"><code>library(plyr) onerow <- function(x) { if (is.data.frame(x)) x <- cor(x[1:5]) dtab <- as.data.frame.table(x[4:5, 1:3]) with(dtab, setNames(Freq, paste(Var2, Var1, sep = "_"))) } adply(By, 1:2, onerow) </code></pre> giving: <pre class="prettyprint"><code> loc seas PC1_A PC1_B PC2_A PC2_B PC3_A PC3_B 1 A S 0.45763159 -0.00925106 0.3522161 0.20916667 -0.2003091 0.3741403 2 B S -0.01779813 -0.74328144 -0.3501188 0.46324158 0.8034240 0.4580262 3 A W 0.74835455 0.49639477 -0.3994917 -0.05233889 -0.5902400 0.3606690 4 B W -0.45025181 -0.66721038 -0.9899521 -0.80989058 0.7606430 0.3738706 </code></pre> or perhaps get rid of <code>by</code> altogether and use this giving the same output: <pre class="prettyprint"><code>library(plyr) ddply(df, -(1:5), onerow) </code></pre> or using dplyr: <pre class="prettyprint"><code>library(dplyr) df %>% group_by_at(-(1:5)) %>% do( onerow(.) %>% t %>% as.data.frame ) %>% ungroup </code></pre>

Correlations between numerous variables grouped in dplyr

Tags:

r

dplyr

correlation

Say I have a data frame, like this:

# Set RNG seed
set.seed(33550336)

# Create dummy data frame
df <- data.frame(PC1 = runif(20),
                 PC2 = runif(20),
                 PC3 = runif(20),
                 A = runif(20),
                 B = runif(20),
                 loc = sample(LETTERS[1:2], 20, replace = TRUE),
                 seas = sample(c("W", "S"), 20, replace = TRUE))

# > head(df)
#         PC1        PC2       PC3         A         B loc seas
# 1 0.8636470 0.02220823 0.7553348 0.4679607 0.0787467   A    S
# 2 0.3522257 0.42733152 0.2412971 0.6691419 0.1194121   A    W
# 3 0.5257408 0.44293320 0.3225228 0.0934192 0.2966507   B    S
# 4 0.0667227 0.90273594 0.6297959 0.1962124 0.4894373   A    W
# 5 0.3751383 0.50477920 0.6567203 0.4510632 0.4742191   B    S
# 6 0.9197086 0.32024904 0.8382138 0.9907894 0.9335657   A    S

I'm interested in calculating correlations between PC1, PC2, and PC3 and each of the variables A and B grouped by loc and seas. So, for example, based on this answer, I could do the following:

# Correlation of variable A and PC1 per loc & seas combination
df %>% 
  group_by(loc, seas) %>% 
  summarise(cor = cor(PC1, A)) %>% 
  ungroup

# # A tibble: 4 x 3
#   loc   seas      cor
#   <fct> <fct>   <dbl>
# 1 A     S      0.458 
# 2 A     W      0.748 
# 3 B     S     -0.0178
# 4 B     W     -0.450

This gives me what I want: the correlation between PC1 and A for each combination of loc and seas. Awesome.

What I'm struggling with is extrapolating this to perform the calculation for each combination of PC* variables and other variables (i.e., A and B, in the example). My expected outcome is the tibble immediately above, but with a column for each combination for PC* and the other variables. I could do this long hand... cor(PC2, A), cor(PC3, A), cor(PC1, B), etc., but presumably there is there a succinct way of coding the calculation. I suspect it involves do, but I can't quite get my head around it... Can someone enlighten me?

Solution

I went with G. Grothendieck's solution below, but this required some restructuring to get it into the required format. I have posted the code I used here in case it is useful for others.

# Perform calculation
res <- by(df[1:5], df[-(1:5)], cor)

# Combinations of loc & seas 
comb <- expand.grid(dimnames(res))

#   loc seas
# 1   A    S
# 2   B    S
# 3   A    W
# 4   B    W

# A matrix corresponding to a loc & seas
# Plus the loc & seas themselves
restructure <- function(m, n){
  # Convert to data frame
  # Add rownames as column
  # Retains PCs as rows, but not columns
  # Gather variables to long format
  # Unite PC & variable names
  # Spread to a single row
  # Add combination of loc & seas
  m %>% 
    data.frame %>% 
    rownames_to_column() %>% 
    filter(grepl("PC", rownames(m))) %>% 
    select(-contains("PC")) %>% 
    gather(variable, value, -rowname) %>% 
    unite(comb, rowname, variable) %>% 
    spread(comb, value) %>% 
    bind_cols(n)
}

# Restructure each list element & combine into data frame
do.call(rbind, lapply(1:length(res), function(x)restructure(res[[x]], comb[x, ])))

which gives,

#         PC1_A       PC1_B      PC2_A       PC2_B      PC3_A     PC3_B loc seas
# 1  0.45763159 -0.00925106  0.3522161  0.20916667 -0.2003091 0.3741403   A    S
# 2 -0.01779813 -0.74328144 -0.3501188  0.46324158  0.8034240 0.4580262   B    S
# 3  0.74835455  0.49639477 -0.3994917 -0.05233889 -0.5902400 0.3606690   A    W
# 4 -0.45025181 -0.66721038 -0.9899521 -0.80989058  0.7606430 0.3738706   B    W

710

asked Dec 13 '18 15:12

Lyngbakr

2 Answers

Use by like this:

By <- by(df[1:5], df[-(1:5)], cor)

giving:

> By
loc: A
seas: S
            PC1        PC2        PC3          A           B
PC1  1.00000000 -0.3941583  0.1872622  0.4576316 -0.00925106
PC2 -0.39415826  1.0000000 -0.6797708  0.3522161  0.20916667
PC3  0.18726218 -0.6797708  1.0000000 -0.2003091  0.37414025
A    0.45763159  0.3522161 -0.2003091  1.0000000  0.57292305
B   -0.00925106  0.2091667  0.3741403  0.5729230  1.00000000
----------------------------------------------------------------------------------------------------------------------------- 
loc: B
seas: S
            PC1         PC2         PC3           A          B
PC1  1.00000000 -0.52651449  0.07120701 -0.01779813 -0.7432814
PC2 -0.52651449  1.00000000 -0.05448583 -0.35011878  0.4632416
PC3  0.07120701 -0.05448583  1.00000000  0.80342399  0.4580262
A   -0.01779813 -0.35011878  0.80342399  1.00000000  0.5558740
B   -0.74328144  0.46324158  0.45802622  0.55587404  1.0000000
----------------------------------------------------------------------------------------------------------------------------- 
loc: A
seas: W
           PC1         PC2        PC3          A           B
PC1  1.0000000 -0.79784422  0.0932317  0.7483545  0.49639477
PC2 -0.7978442  1.00000000 -0.3526315 -0.3994917 -0.05233889
PC3  0.0932317 -0.35263151  1.0000000 -0.5902400  0.36066898
A    0.7483545 -0.39949171 -0.5902400  1.0000000  0.18081316
B    0.4963948 -0.05233889  0.3606690  0.1808132  1.00000000
----------------------------------------------------------------------------------------------------------------------------- 
loc: B
seas: W
           PC1        PC2        PC3          A          B
PC1  1.0000000  0.3441459  0.1135686 -0.4502518 -0.6672104
PC2  0.3441459  1.0000000 -0.8447551 -0.9899521 -0.8098906
PC3  0.1135686 -0.8447551  1.0000000  0.7606430  0.3738706
A   -0.4502518 -0.9899521  0.7606430  1.0000000  0.8832408
B   -0.6672104 -0.8098906  0.3738706  0.8832408  1.0000000

ADDED

Based on further discussion by poster on what is wanted define the onerow function which accepts a correlation matrix or a data frame (in the latter case it converts the first 5 columns to a correlatoin matrix) producing one row of the output. The if statement in onerow is not needed, but won't hurt, for the adply line of code but we have included it so that onerow also works in a simple manner in subsequent examples below as well.

library(plyr)

onerow <- function(x) {
  if (is.data.frame(x)) x <- cor(x[1:5])
  dtab <- as.data.frame.table(x[4:5, 1:3])
  with(dtab, setNames(Freq, paste(Var2, Var1, sep = "_")))
}

adply(By, 1:2, onerow)

giving:

  loc seas       PC1_A       PC1_B      PC2_A       PC2_B      PC3_A     PC3_B
1   A    S  0.45763159 -0.00925106  0.3522161  0.20916667 -0.2003091 0.3741403
2   B    S -0.01779813 -0.74328144 -0.3501188  0.46324158  0.8034240 0.4580262
3   A    W  0.74835455  0.49639477 -0.3994917 -0.05233889 -0.5902400 0.3606690
4   B    W -0.45025181 -0.66721038 -0.9899521 -0.80989058  0.7606430 0.3738706

or perhaps get rid of by altogether and use this giving the same output:

library(plyr)
ddply(df, -(1:5), onerow)

or using dplyr:

library(dplyr)
df %>%
  group_by_at(-(1:5)) %>%
  do( onerow(.) %>% t %>% as.data.frame ) %>%
  ungroup

142

answered Sep 27 '22 23:09

G. Grothendieck

We can do a split and cor in base R

lapply(split(df[1:5], df[-(1:5)]), cor)
#$A.S
#            PC1        PC2        PC3          A           B
#PC1  1.00000000 -0.3941583  0.1872622  0.4576316 -0.00925106
#PC2 -0.39415826  1.0000000 -0.6797708  0.3522161  0.20916667
#PC3  0.18726218 -0.6797708  1.0000000 -0.2003091  0.37414025
#A    0.45763159  0.3522161 -0.2003091  1.0000000  0.57292305
#B   -0.00925106  0.2091667  0.3741403  0.5729230  1.00000000

#$B.S
#            PC1         PC2         PC3           A          B
#PC1  1.00000000 -0.52651449  0.07120701 -0.01779813 -0.7432814
#PC2 -0.52651449  1.00000000 -0.05448583 -0.35011878  0.4632416
#PC3  0.07120701 -0.05448583  1.00000000  0.80342399  0.4580262
#A   -0.01779813 -0.35011878  0.80342399  1.00000000  0.5558740
#B   -0.74328144  0.46324158  0.45802622  0.55587404  1.0000000

#$A.W
#           PC1         PC2        PC3          A           B
#PC1  1.0000000 -0.79784422  0.0932317  0.7483545  0.49639477
#PC2 -0.7978442  1.00000000 -0.3526315 -0.3994917 -0.05233889
#PC3  0.0932317 -0.35263151  1.0000000 -0.5902400  0.36066898
#A    0.7483545 -0.39949171 -0.5902400  1.0000000  0.18081316
#B    0.4963948 -0.05233889  0.3606690  0.1808132  1.00000000

#$B.W
#           PC1        PC2        PC3          A          B
#PC1  1.0000000  0.3441459  0.1135686 -0.4502518 -0.6672104
#PC2  0.3441459  1.0000000 -0.8447551 -0.9899521 -0.8098906
#PC3  0.1135686 -0.8447551  1.0000000  0.7606430  0.3738706
#A   -0.4502518 -0.9899521  0.7606430  1.0000000  0.8832408
#B   -0.6672104 -0.8098906  0.3738706  0.8832408  1.0000000

Or using tidyverse

library(tidyverse)
df %>% 
    group_by_at(6:7) %>% 
    nest %>% 
    mutate(data = map(data, cor))

answered Sep 27 '22 23:09

akrun

Related questions
                            
                                Rowwise matrix multiplication in R
                            
                                Rotate Strip Text in ggplot2
                            
                                How to start debugger only when condition is met
                            
                                Choose a function from a tibble of function names, and pipe to it
                            
                                Extract hourly max/min/median values with timestamp in R
                            
                                How to rename elements inside a list of lists in the an efficient manner in R?
                            
                                filter rows when all columns greater than a value
                            
                                Large performance differences between OS for matrix computation
                            
                                How to annotate line plot with arrow and maximum value?
                            
                                Calculating pvalue within a huge data frame takes very long
                            
                                alignment of a flextable on the page when rendering R Markdown to MS Word
                            
                                Extracting numbers from text with stringr and regex in R
                            
                                Assign unique ID per multiple columns of data table
                            
                                how to efficiently import multiple raster (.tif) files into R
                            
                                Use different font sizes for different portions of text in ggplot2 title
                            
                                Bar charts connected by lines / How to connect two graphs arranged with grid.arrange in R / ggplot2
                            
                                Functions and non-standard evaluation in dplyr
                            
                                ggplot2 - Correctly arrange odd number of plots into one figure
                            
                                geom_tile : Clean Diagonal Tiles Border
                            
                                Stacked barchart, independent fill order for each stack

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With