Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spearman correlation by group in R

Tags:

r

How do you calculate Spearman correlation by group in R. I found the following link talking about Pearson correlation by group. But when I tried to replace the type with spearman, it does not work.

https://stats.stackexchange.com/questions/4040/r-compute-correlation-by-group

like image 607
user1009166 Avatar asked Jan 09 '12 16:01

user1009166


People also ask

How do you calculate Spearman correlation in R?

To calculate Spearman's ρ in R, first, rank the x and y variables. A new data. frame is created to keep the ranked variables. Take the covariance of the variables and divide by the product of the x and y variables' standard deviations to find Spearman's ρ.

Can Spearman correlation be used for categorical variables?

The correlation between two numeric variables can be measured with Spearman coefficient. To measure the relationship between numeric variable and categorical variable with > 2 levels you should use eta correlation (square root of the R2 of the multifactorial regression).

What is the difference between Spearman rho and Pearson r?

The fundamental difference between the two correlation coefficients is that the Pearson coefficient works with a linear relationship between the two variables whereas the Spearman Coefficient works with monotonic relationships as well.


4 Answers

How about this for a base R solution:

df <- data.frame(group = rep(c("G1", "G2"), each = 10),
                 var1 = rnorm(20),
                 var2 = rnorm(20))

r <- by(df, df$group, FUN = function(X) cor(X$var1, X$var2, method = "spearman"))
# df$group: G1
# [1] 0.4060606
# ------------------------------------------------------------ 
# df$group: G2
# [1] 0.1272727

And then, if you want the results in the form of a data.frame:

data.frame(group = dimnames(r)[[1]], corr = as.vector(r))
#   group      corr
# 1    G1 0.4060606
# 2    G2 0.1272727

EDIT: If you prefer a plyr-based solution, here is one:

library(plyr)
ddply(df, .(group), summarise, "corr" = cor(var1, var2, method = "spearman"))
like image 124
Josh O'Brien Avatar answered Oct 02 '22 22:10

Josh O'Brien


very old question, but this tidy & broom solution is extremely straightforward. Thus I have to share the approach:

set.seed(123)
df <- data.frame(group = rep(c("G1", "G2"), each = 10),
                 var1 = rnorm(20),
                 var2 = rnorm(20))

library(tidyverse)
library(broom)

df  %>% 
  group_by(group) %>%
  summarize(correlation = cor(var1, var2,, method = "sp"))
# A tibble: 2 x 2
  group correlation
  <fct>       <dbl>
1 G1        -0.200 
2 G2         0.0545

# with pvalues and further stats
df %>% 
  nest(-group) %>% 
  mutate(cor=map(data,~cor.test(.x$var1, .x$var2, method = "sp"))) %>%
  mutate(tidied = map(cor, tidy)) %>% 
  unnest(tidied, .drop = T)
# A tibble: 2 x 6
  group estimate statistic p.value method                          alternative
  <fct>    <dbl>     <dbl>   <dbl> <chr>                           <chr>      
1 G1     -0.200        198   0.584 Spearman's rank correlation rho two.sided  
2 G2      0.0545       156   0.892 Spearman's rank correlation rho two.sided 

Since some time/dplyr version, you need to write this to get results like above and no errors:

df %>% 
  nest(data = -group) %>%
  mutate(cor=map(data,~cor.test(.x$var1, .x$var2, method = "sp"))) %>%
  mutate(tidied = map(cor, tidy)) %>% 
  unnest(tidied) %>% 
  select(-data, -cor)
like image 42
Roman Avatar answered Oct 02 '22 23:10

Roman


Here's another way to do it:

# split the data by group then apply spearman correlation
# to each element of that list
j <- lapply(split(df, df$group), function(x){cor(x[,2], x[,3], method = "spearman")})

# Bring it together
data.frame(group = names(j), corr = unlist(j), row.names = NULL)

Comparing my method, Josh's method, and the plyr solution using rbenchmark:

Dason <- function(){
    # split the data by group then apply spearman correlation
    # to each element of that list
    j <- lapply(split(df, df$group), function(x){cor(x[,2], x[,3], method = "spearman")})

    # Bring it together
    data.frame(group = names(j), corr = unlist(j), row.names = NULL)
}

Josh <- function(){
    r <- by(df, df$group, FUN = function(X) cor(X$var1, X$var2, method = "spearman"))
    data.frame(group = attributes(r)$dimnames[[1]], corr = as.vector(r))
}

plyr <- function(){
    ddply(df, .(group), summarise, "corr" = cor(var1, var2, method = "spearman"))
}


library(rbenchmark)
benchmark(Dason(), Josh(), plyr())

Which gives the output

> benchmark(Dason(), Josh(), plyr())
     test replications elapsed relative user.self sys.self user.child sys.child
1 Dason()          100    0.19 1.000000      0.19        0         NA        NA
2  Josh()          100    0.24 1.263158      0.22        0         NA        NA
3  plyr()          100    0.51 2.684211      0.52        0         NA        NA

So it appears my method is slightly faster but not by much. I think Josh's method is a little more intuitive. The plyr solution is the easiest to code up but it's not the fastest (but it sure is a lot more convenient)!

like image 22
Dason Avatar answered Oct 02 '22 21:10

Dason


If you want an efficient solution for large numbers of groups then data.table is the way to go.

library(data.table)
DT <- as.data.table(df)
setkey(DT, group)
DT[,list(corr = cor(var1,var2,method = 'spearman')), by = group]
like image 24
mnel Avatar answered Oct 02 '22 22:10

mnel