Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correlation between groups in R data.table

Is there a way of elegantly calculating the correlations between values if those values are stored by group in a single column of a data.table (other than converting the data.table to a matrix)?

library(data.table)
set.seed(1)             # reproducibility
dt <- data.table(id=1:4, group=rep(letters[1:2], c(4,4)), value=rnorm(8))
setkey(dt, group)

#    id group      value
# 1:  1     a -0.6264538
# 2:  2     a  0.1836433
# 3:  3     a -0.8356286
# 4:  4     a  1.5952808
# 5:  1     b  0.3295078
# 6:  2     b -0.8204684
# 7:  3     b  0.4874291
# 8:  4     b  0.7383247

Something that works, but requires the group names as input:

cor(dt["a"]$value, dt["b"]$value)
# [1] 0.1556371

I'm looking more for something like:

dt[, cor(value, value), by="group"]

But that does not give me the correlation(s) I'm after.

Here's the same problem for a matrix with the correct results.

set.seed(1)             # reproducibility
m <- matrix(rnorm(8), ncol=2)
dimnames(m) <- list(id=1:4, group=letters[1:2])

#        group
# id           a          b
#   1 -0.6264538  0.3295078
#   2  0.1836433 -0.8204684
#   3 -0.8356286  0.4874291
#   4  1.5952808  0.7383247

cor(m)                  # correlations between groups

#           a         b
# a 1.0000000 0.1556371
# b 0.1556371 1.0000000

Any comments or help greatly appreciated.

like image 398
Bram Visser Avatar asked Mar 15 '14 08:03

Bram Visser


People also ask

How do I correlate multiple columns in R?

In this method, the user has to call the cor() function and then within this function the user has to pass the name of the multiple variables in the form of vector as its parameter to get the correlation among multiple variables by specifying multiple column names in the R programming language.

What is a correlation table in R?

A correlation matrix is a table of correlation coefficients for a set of variables used to determine if a relationship exists between the variables. The coefficient indicates both the strength of the relationship as well as the direction (positive vs. negative correlations).

How do you find the correlation of a Dataframe in R?

The function rcorr() [in Hmisc package] can be used to compute the significance levels for pearson and spearman correlations. It returns both the correlation coefficients and the p-value of the correlation for all possible pairs of columns in the data table.

What is correlation between groups?

“Correlation” is a statistical tool used to assess the degree of association of two quantitative variables measured in each member of a group. Although it is a very commonly used tool in medical literature, it is also often misunderstood.


2 Answers

I've since found an even simple alternative for doing this. You were actually pretty close with your dt[, cor(value, value), by="group"] approach. What you actually need is to first do a Cartesian join on the dates, and then group by. I.e.

dt[dt, allow.cartesian=T][, cor(value, value), by=list(group, group.1)]

This has the advantage that it will join the series together (rather than assume they are the same length). You can then cast this into matrix form, or leave it as it is to plot as a heatmap in ggplot etc.

Full Example

setkey(dt, id)
c <- dt[dt, allow.cartesian=T][, list(Cor = cor(value, value.1)), by = list(group, group.1)]
c

   group group.1       Cor
1:     a       a 1.0000000
2:     b       a 0.1556371
3:     a       b 0.1556371
4:     b       b 1.0000000

dcast(c, group~group.1, value.var = "Cor")

  group         a         b
1     a 1.0000000 0.1556371
2     b 0.1556371 1.0000000
like image 119
Corvus Avatar answered Sep 21 '22 07:09

Corvus


There is no simple way to do this with data.table. The first way you've provided:

cor(dt["a"]$value, dt["b"]$value)

Is probably the simplest.

An alternative is to reshape your data.table from "long" format, to "wide" format:

> dtw <- reshape(dt, timevar="group", idvar="id", direction="wide")
> dtw
   id    value.a    value.b
1:  1 -0.6264538  0.3295078
2:  2  0.1836433 -0.8204684
3:  3 -0.8356286  0.4874291
4:  4  1.5952808  0.7383247
> cor(dtw[,list(value.a, value.b)])
          value.a   value.b
value.a 1.0000000 0.1556371
value.b 0.1556371 1.0000000

Update: If you're using data.table version >= 1.9.0, then you can use dcast.data.table instead which'll be much faster. Check this post for more info.

dcast.data.table(dt, id ~ group)
like image 41
Scott Ritchie Avatar answered Sep 20 '22 07:09

Scott Ritchie