Say I have a data frame, like this:
# Set RNG seed
set.seed(33550336)
# Create dummy data frame
df <- data.frame(PC1 = runif(20),
PC2 = runif(20),
PC3 = runif(20),
A = runif(20),
B = runif(20),
loc = sample(LETTERS[1:2], 20, replace = TRUE),
seas = sample(c("W", "S"), 20, replace = TRUE))
# > head(df)
# PC1 PC2 PC3 A B loc seas
# 1 0.8636470 0.02220823 0.7553348 0.4679607 0.0787467 A S
# 2 0.3522257 0.42733152 0.2412971 0.6691419 0.1194121 A W
# 3 0.5257408 0.44293320 0.3225228 0.0934192 0.2966507 B S
# 4 0.0667227 0.90273594 0.6297959 0.1962124 0.4894373 A W
# 5 0.3751383 0.50477920 0.6567203 0.4510632 0.4742191 B S
# 6 0.9197086 0.32024904 0.8382138 0.9907894 0.9335657 A S
I'm interested in calculating correlations between PC1
, PC2
, and PC3
and each of the variables A
and B
grouped by loc
and seas
. So, for example, based on this answer, I could do the following:
# Correlation of variable A and PC1 per loc & seas combination
df %>%
group_by(loc, seas) %>%
summarise(cor = cor(PC1, A)) %>%
ungroup
# # A tibble: 4 x 3
# loc seas cor
# <fct> <fct> <dbl>
# 1 A S 0.458
# 2 A W 0.748
# 3 B S -0.0178
# 4 B W -0.450
This gives me what I want: the correlation between PC1
and A
for each combination of loc
and seas
. Awesome.
What I'm struggling with is extrapolating this to perform the calculation for each combination of PC*
variables and other variables (i.e., A
and B
, in the example). My expected outcome is the tibble immediately above, but with a column for each combination for PC*
and the other variables. I could do this long hand... cor(PC2, A)
, cor(PC3, A)
, cor(PC1, B)
, etc., but presumably there is there a succinct way of coding the calculation. I suspect it involves do
, but I can't quite get my head around it... Can someone enlighten me?
I went with G. Grothendieck's solution below, but this required some restructuring to get it into the required format. I have posted the code I used here in case it is useful for others.
# Perform calculation
res <- by(df[1:5], df[-(1:5)], cor)
# Combinations of loc & seas
comb <- expand.grid(dimnames(res))
# loc seas
# 1 A S
# 2 B S
# 3 A W
# 4 B W
# A matrix corresponding to a loc & seas
# Plus the loc & seas themselves
restructure <- function(m, n){
# Convert to data frame
# Add rownames as column
# Retains PCs as rows, but not columns
# Gather variables to long format
# Unite PC & variable names
# Spread to a single row
# Add combination of loc & seas
m %>%
data.frame %>%
rownames_to_column() %>%
filter(grepl("PC", rownames(m))) %>%
select(-contains("PC")) %>%
gather(variable, value, -rowname) %>%
unite(comb, rowname, variable) %>%
spread(comb, value) %>%
bind_cols(n)
}
# Restructure each list element & combine into data frame
do.call(rbind, lapply(1:length(res), function(x)restructure(res[[x]], comb[x, ])))
which gives,
# PC1_A PC1_B PC2_A PC2_B PC3_A PC3_B loc seas
# 1 0.45763159 -0.00925106 0.3522161 0.20916667 -0.2003091 0.3741403 A S
# 2 -0.01779813 -0.74328144 -0.3501188 0.46324158 0.8034240 0.4580262 B S
# 3 0.74835455 0.49639477 -0.3994917 -0.05233889 -0.5902400 0.3606690 A W
# 4 -0.45025181 -0.66721038 -0.9899521 -0.80989058 0.7606430 0.3738706 B W
A correlation is usually tested for two variables at a time, but you can test correlations between three or more variables.
In this method to calculate the correlation between two variables, the user has to simply call the corr() function from the base R, passed with the required parameters which will be the name of the variables whose correlation is needed to be calculated and further this will be returning the correlation detail between ...
Steps to compare Correlation Coefficient between Two GroupsFrom the menu at the top of the screen, click on Data, and then select Split File. Click on Compare Groups. Move the grouping variable (e.g. Gender) into the box labeled Groups based on. Click on OK.
It is denoted by r and values between -1 and +1. A positive value for r indicates a positive association, and a negative value for r indicates a negative association. By using corr() function we can get the correlation between two columns in the dataframe.
Use by
like this:
By <- by(df[1:5], df[-(1:5)], cor)
giving:
> By
loc: A
seas: S
PC1 PC2 PC3 A B
PC1 1.00000000 -0.3941583 0.1872622 0.4576316 -0.00925106
PC2 -0.39415826 1.0000000 -0.6797708 0.3522161 0.20916667
PC3 0.18726218 -0.6797708 1.0000000 -0.2003091 0.37414025
A 0.45763159 0.3522161 -0.2003091 1.0000000 0.57292305
B -0.00925106 0.2091667 0.3741403 0.5729230 1.00000000
-----------------------------------------------------------------------------------------------------------------------------
loc: B
seas: S
PC1 PC2 PC3 A B
PC1 1.00000000 -0.52651449 0.07120701 -0.01779813 -0.7432814
PC2 -0.52651449 1.00000000 -0.05448583 -0.35011878 0.4632416
PC3 0.07120701 -0.05448583 1.00000000 0.80342399 0.4580262
A -0.01779813 -0.35011878 0.80342399 1.00000000 0.5558740
B -0.74328144 0.46324158 0.45802622 0.55587404 1.0000000
-----------------------------------------------------------------------------------------------------------------------------
loc: A
seas: W
PC1 PC2 PC3 A B
PC1 1.0000000 -0.79784422 0.0932317 0.7483545 0.49639477
PC2 -0.7978442 1.00000000 -0.3526315 -0.3994917 -0.05233889
PC3 0.0932317 -0.35263151 1.0000000 -0.5902400 0.36066898
A 0.7483545 -0.39949171 -0.5902400 1.0000000 0.18081316
B 0.4963948 -0.05233889 0.3606690 0.1808132 1.00000000
-----------------------------------------------------------------------------------------------------------------------------
loc: B
seas: W
PC1 PC2 PC3 A B
PC1 1.0000000 0.3441459 0.1135686 -0.4502518 -0.6672104
PC2 0.3441459 1.0000000 -0.8447551 -0.9899521 -0.8098906
PC3 0.1135686 -0.8447551 1.0000000 0.7606430 0.3738706
A -0.4502518 -0.9899521 0.7606430 1.0000000 0.8832408
B -0.6672104 -0.8098906 0.3738706 0.8832408 1.0000000
Based on further discussion by poster on what is wanted define the onerow
function which accepts a correlation matrix or a data frame (in the latter case it converts the first 5 columns to a correlatoin matrix) producing one row of the output. The if
statement in onerow
is not needed, but won't hurt, for the adply
line of code but we have included it so that onerow
also works in a simple manner in subsequent examples below as well.
library(plyr)
onerow <- function(x) {
if (is.data.frame(x)) x <- cor(x[1:5])
dtab <- as.data.frame.table(x[4:5, 1:3])
with(dtab, setNames(Freq, paste(Var2, Var1, sep = "_")))
}
adply(By, 1:2, onerow)
giving:
loc seas PC1_A PC1_B PC2_A PC2_B PC3_A PC3_B
1 A S 0.45763159 -0.00925106 0.3522161 0.20916667 -0.2003091 0.3741403
2 B S -0.01779813 -0.74328144 -0.3501188 0.46324158 0.8034240 0.4580262
3 A W 0.74835455 0.49639477 -0.3994917 -0.05233889 -0.5902400 0.3606690
4 B W -0.45025181 -0.66721038 -0.9899521 -0.80989058 0.7606430 0.3738706
or perhaps get rid of by
altogether and use this giving the same output:
library(plyr)
ddply(df, -(1:5), onerow)
or using dplyr:
library(dplyr)
df %>%
group_by_at(-(1:5)) %>%
do( onerow(.) %>% t %>% as.data.frame ) %>%
ungroup
We can do a split
and cor
in base R
lapply(split(df[1:5], df[-(1:5)]), cor)
#$A.S
# PC1 PC2 PC3 A B
#PC1 1.00000000 -0.3941583 0.1872622 0.4576316 -0.00925106
#PC2 -0.39415826 1.0000000 -0.6797708 0.3522161 0.20916667
#PC3 0.18726218 -0.6797708 1.0000000 -0.2003091 0.37414025
#A 0.45763159 0.3522161 -0.2003091 1.0000000 0.57292305
#B -0.00925106 0.2091667 0.3741403 0.5729230 1.00000000
#$B.S
# PC1 PC2 PC3 A B
#PC1 1.00000000 -0.52651449 0.07120701 -0.01779813 -0.7432814
#PC2 -0.52651449 1.00000000 -0.05448583 -0.35011878 0.4632416
#PC3 0.07120701 -0.05448583 1.00000000 0.80342399 0.4580262
#A -0.01779813 -0.35011878 0.80342399 1.00000000 0.5558740
#B -0.74328144 0.46324158 0.45802622 0.55587404 1.0000000
#$A.W
# PC1 PC2 PC3 A B
#PC1 1.0000000 -0.79784422 0.0932317 0.7483545 0.49639477
#PC2 -0.7978442 1.00000000 -0.3526315 -0.3994917 -0.05233889
#PC3 0.0932317 -0.35263151 1.0000000 -0.5902400 0.36066898
#A 0.7483545 -0.39949171 -0.5902400 1.0000000 0.18081316
#B 0.4963948 -0.05233889 0.3606690 0.1808132 1.00000000
#$B.W
# PC1 PC2 PC3 A B
#PC1 1.0000000 0.3441459 0.1135686 -0.4502518 -0.6672104
#PC2 0.3441459 1.0000000 -0.8447551 -0.9899521 -0.8098906
#PC3 0.1135686 -0.8447551 1.0000000 0.7606430 0.3738706
#A -0.4502518 -0.9899521 0.7606430 1.0000000 0.8832408
#B -0.6672104 -0.8098906 0.3738706 0.8832408 1.0000000
Or using tidyverse
library(tidyverse)
df %>%
group_by_at(6:7) %>%
nest %>%
mutate(data = map(data, cor))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With