Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reshape long structured data.table into a wide structure using data.table functionality?

Tags:

r

data.table

> library(data.table)
> A <- data.table(x = c(1,1,2,2), y = c(1,2,1,2), v = c(0.1,0.2,0.3,0.4))
> A
   x y   v
1: 1 1 0.1
2: 1 2 0.2
3: 2 1 0.3
4: 2 2 0.4
> B <- dcast(A, x~y)
Using v as value column: use value.var to override.
> B
  x   1   2
1 1 0.1 0.2
2 2 0.3 0.4

Apparently I can reshape a data.table from long to wide using f.x. dcast of package reshape2. But data.table comes along with an overloaded bracket-operator offering parameters like 'by' and 'group', which make me wonder if it is possible to achieve it using this (to data.table specific functionality)?

Just one random example from the manual:

DT[,lapply(.SD,sum),by=x]

That looks awesome - but I don't fully understand the usage yet.

I neither found a way nor an example for this so maybe it is just not possible maybe it isn't even supposed to be - so, a definite "no, is not possible because ..." is then of course also a valid answer.

like image 541
Raffael Avatar asked Aug 04 '13 21:08

Raffael


2 Answers

I'll pick an example with unequal groups so that it's easier to illustrate for the general case:

A <- data.table(x=c(1,1,1,2,2), y=c(1,2,3,1,2), v=(1:5)/5)
> A
   x y   v
1: 1 1 0.2
2: 1 2 0.4
3: 1 3 0.6
4: 2 1 0.8
5: 2 2 1.0

The first step is to get the number of elements/entries for each group of "x" to be the same. Here, for x=1 there are 3 values of y, but only 2 for x=2. So, we'll have to fix that first with NA for x=2, y=3.

setkey(A, x, y)
A[CJ(unique(x), unique(y))]

Now, to get it to wide format, we should group by "x" and use as.list on v as follows:

out <- A[CJ(unique(x), unique(y))][, as.list(v), by=x]
   x  V1  V2  V3
1: 1 0.2 0.4 0.6
2: 2 0.8 1.0  NA

Now, you can set the names of the reshaped columns using reference with setnames as follows:

setnames(out, c("x", as.character(unique(A$y)))

   x   1   2   3
1: 1 0.2 0.4 0.6
2: 2 0.8 1.0  NA
like image 114
Arun Avatar answered Oct 23 '22 12:10

Arun


Use dcast() (now a default data.table method, from version 1.9.5; earlier versions use dcast.data.table) as in

> dcast(A,x~y)
Using 'v' as value column. Use 'value.var' to override
   x   1   2   3
1: 1 0.2 0.4 0.6
2: 2 0.8 1.0  NA

This is fast and obviates the need to setnames().

It is also especially helpful when y in the above example is a factor variable with character levels -- e.g. 'Low', 'Medium', 'High' -- because CJ() may not return the wide data with variables in the order that setnames() expects, and you can end up with your data mislabeled badly.

like image 37
Gabi Avatar answered Oct 23 '22 11:10

Gabi