Correlation between groups in R data.table

Tags:

Is there a way of elegantly calculating the correlations between values if those values are stored by group in a single column of a data.table (other than converting the data.table to a matrix)?

library(data.table)
set.seed(1)             # reproducibility
dt <- data.table(id=1:4, group=rep(letters[1:2], c(4,4)), value=rnorm(8))
setkey(dt, group)

#    id group      value
# 1:  1     a -0.6264538
# 2:  2     a  0.1836433
# 3:  3     a -0.8356286
# 4:  4     a  1.5952808
# 5:  1     b  0.3295078
# 6:  2     b -0.8204684
# 7:  3     b  0.4874291
# 8:  4     b  0.7383247

Something that works, but requires the group names as input:

cor(dt["a"]$value, dt["b"]$value)
# [1] 0.1556371

I'm looking more for something like:

dt[, cor(value, value), by="group"]

But that does not give me the correlation(s) I'm after.

Here's the same problem for a matrix with the correct results.

set.seed(1)             # reproducibility
m <- matrix(rnorm(8), ncol=2)
dimnames(m) <- list(id=1:4, group=letters[1:2])

#        group
# id           a          b
#   1 -0.6264538  0.3295078
#   2  0.1836433 -0.8204684
#   3 -0.8356286  0.4874291
#   4  1.5952808  0.7383247

cor(m)                  # correlations between groups

#           a         b
# a 1.0000000 0.1556371
# b 0.1556371 1.0000000

Any comments or help greatly appreciated.

398

asked Mar 15 '14 08:03

Bram Visser

2 Answers

I've since found an even simple alternative for doing this. You were actually pretty close with your dt[, cor(value, value), by="group"] approach. What you actually need is to first do a Cartesian join on the dates, and then group by. I.e.

dt[dt, allow.cartesian=T][, cor(value, value), by=list(group, group.1)]

This has the advantage that it will join the series together (rather than assume they are the same length). You can then cast this into matrix form, or leave it as it is to plot as a heatmap in ggplot etc.

Full Example

setkey(dt, id)
c <- dt[dt, allow.cartesian=T][, list(Cor = cor(value, value.1)), by = list(group, group.1)]
c

   group group.1       Cor
1:     a       a 1.0000000
2:     b       a 0.1556371
3:     a       b 0.1556371
4:     b       b 1.0000000

dcast(c, group~group.1, value.var = "Cor")

  group         a         b
1     a 1.0000000 0.1556371
2     b 0.1556371 1.0000000

119

answered Sep 21 '22 07:09

Corvus

There is no simple way to do this with data.table. The first way you've provided:

cor(dt["a"]$value, dt["b"]$value)

Is probably the simplest.

An alternative is to reshape your data.table from "long" format, to "wide" format:

> dtw <- reshape(dt, timevar="group", idvar="id", direction="wide")
> dtw
   id    value.a    value.b
1:  1 -0.6264538  0.3295078
2:  2  0.1836433 -0.8204684
3:  3 -0.8356286  0.4874291
4:  4  1.5952808  0.7383247
> cor(dtw[,list(value.a, value.b)])
          value.a   value.b
value.a 1.0000000 0.1556371
value.b 0.1556371 1.0000000

Update: If you're using data.table version >= 1.9.0, then you can use dcast.data.table instead which'll be much faster. Check this post for more info.

dcast.data.table(dt, id ~ group)

answered Sep 20 '22 07:09

Scott Ritchie

Related questions
                            
                                Naming array dimensions gives error: length of 'dimnames' not equal to array extent
                            
                                Embed an R process in a VBA macro
                            
                                Terminating an apply-based function early (similar to break?)
                            
                                How can I "think OOP" when using R?
                            
                                Subtly different behaviour between with() and attach() in R?
                            
                                How can I get the min/max possible numeric?
                            
                                Plotting curves given by equations in R
                            
                                Convert a dataframe to an object of class "dist" without actually calculating distances in R
                            
                                ggplot - change line width
                            
                                replace trailing periods with spaces
                            
                                Can't coerce class of matrix numbers to integer
                            
                                How to output a list to file in R
                            
                                How can I pass data between functions in a Shiny app
                            
                                organization chart triangle plot
                            
                                convert matrix to raster in R
                            
                                Adding lists names as plot titles in lapply call in R
                            
                                Split a string column into several dummy variables
                            
                                R- how to dynamically name data frames? [duplicate]
                            
                                Mysterious error by parsing French dates on OSX
                            
                                How can I resolve the following dimension mismatch with R's K nearest neighbors?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Correlation between groups in R data.table

Tags:

r

data.table

correlation

Bram Visser

People also ask

2 Answers

Corvus

Scott Ritchie

Recent Activity

Donate For Us