Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I self join a data.table in a manner like dcast

Tags:

r

data.table

Suppose I have a data.table in "melted" form where I have a key, and identifier and a value

library(data.table)
library(reshape2)
DT = data.table(X = c(1:5, 1:4), Y = c(rep("A", 5), rep("B", 4)), Z = rnorm(9))
DT2 = data.table(dcast(DT, X~Y))

How can I perform that sort of self join inside data.table?

> DT
   X Y           Z
1: 1 A -0.19790449
2: 2 A  0.17906116
3: 3 A  0.01821837
4: 4 A  0.17309716
5: 5 A  0.05962474
6: 1 B -0.24629468
7: 2 B  0.92285734
8: 3 B  0.66002573
9: 4 B -1.01403880
> DT2
   X           A          B
1: 1 -0.19790449 -0.2462947
2: 2  0.17906116  0.9228573
3: 3  0.01821837  0.6600257
4: 4  0.17309716 -1.0140388
5: 5  0.05962474         NA

Aside (mostly for Arun): Here is a solution I already use for melt (was written with help from Matthew D, so he should have this code), that I think replicates melt completely, and is pretty efficient. Dcast on the other hand (or should that be dtcast?) is much harder!

melt.data.table = function(data, id.vars, measure.vars,
                           variable.name = "variable",
                           ..., na.rm = FALSE, value.name = "value") {
  if(missing(id.vars)){
    id.vars = setdiff(names(data), measure.vars)
  }
  if(missing(measure.vars)){
    measure.vars = setdiff(names(data), id.vars)
  }

  dtlist = lapply(measure.vars, function(..colname) {
    data[, c(id.vars, ..colname), with = FALSE][, (variable.name) := ..colname]
  })

  dt = rbindlist(dtlist)
  setnames(dt, measure.vars[1], value.name)
  if(na.rm){
    return(na.omit(dt))
  } else {
    return(dt)
  }
}
like image 753
Corvus Avatar asked Sep 10 '13 22:09

Corvus


1 Answers

Update: faster versions of melt and dcast are now implemented (in C) in data.table versions >= 1.9.0. Check this post for more info.

Now you can just do:

dcast.data.table(DT, X~Y)

In case of dcast alone, at the moment, it has to be written out completely (as it's not a S3 generic yet in reshape2). We'll try to fix this as soon as possible. For melt, you can just use melt(.) as you'd do normally.


The general idea is this:

setkey(DT, X, Y)
DT[CJ(1:5, c("A", "B"))][, as.list(Z), by=X]

You can name the columns V1 and V2 as A and B using setnames.

But this may not be efficient on large data or when the cast formula is complex. Or rather I should say, it could be much more efficient. We're in the process of finding such an implementation to integrate melt and cast on to data.table. Until then, you could get around this as above.

I'll update this post once we've made significant progress with melt/cast.

like image 156
Arun Avatar answered Nov 16 '22 03:11

Arun