Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R reshaping melted data.table with list column

I have a large (millions of rows) melted data.table with the usual melt-style unrolling in the variable and value columns. I need to cast the table in wide form (rolling the variables up). The problem is that the data table also has a list column called data, which I need to preserve. This makes it impossible to use reshape2 because dcast cannot deal with non-atomic columns. Therefore, I need to do the rolling up myself.

The answer from a previous question about working with melted data tables does not apply here because of the list column.

I am not satisfied with the solution I've come up with. I'm looking for suggestions for a simpler/faster implementation.

x <- LETTERS[1:3]
dt <- data.table(
  x=rep(x, each=2),
  y='d',
  data=list(list(), list(), list(), list(), list(), list()),
  variable=rep(c('var.1', 'var.2'), 3),
  value=seq(1,6)
  )

# Column template set up
list_template <- Reduce(
  function(l, col) { l[[col]] <- col; l }, 
  unique(dt$variable),
  list())

# Expression set up
q <- substitute({
  l <- lapply(
    list_template, 
    function(col) .SD[variable==as.character(col)]$value)
  l$data = .SD[1,]$data
  l
}, list(list_template=list_template))

# Roll up
dt[, eval(q), by=list(x, y)]

   x y var.1 var.2   data
1: A d     1     2 <list>
2: B d     3     4 <list>
3: C d     5     6 <list>
like image 229
Sim Avatar asked Nov 04 '22 05:11

Sim


1 Answers

This old question piqued my curiosity as data.table has been improved sigificantly since 2013.

However, even with data.table version 1.11.4

dcast(dt, x + y + data ~ variable)

still returns an error

Columns specified in formula can not be of type list

The workaround follows the general outline of jonsedar's answer :

  1. Reshape the non-list columns from long to wide format
  2. Aggregate the list column data grouped by x and y
  3. Join the two partial results on x and y

but uses the features of the actual data.table syntax, e.g., the on parameter:

dcast(dt, x + y ~ variable)[
  dt[, .(data = .(first(data))), by = .(x, y)], on = .(x, y)] 
   x y var.1 var.2   data
1: A d     1     2 <list>
2: B d     3     4 <list>
3: C d     5     6 <list>

The list column data is aggregated by taking the first element. This is in line with OP's code line

l$data = .SD[1,]$data

which also picks the first element.

like image 111
Uwe Avatar answered Nov 09 '22 14:11

Uwe