Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Make the `drop` argument in `dcast` only look at the RHS of the formula

The drop argument in dcast (from "reshape2" or "dplyr") can be useful when going from a "long" to a "wide" dataset and you want to create columns even for combinations that do not exist in the long form.

It turns out that using drop also affects combinations the left hand side (LHS) of the formula as well as the right hand side (RHS). Thus, it also creates extra rows based on the combinations of LHS values.

Is there a way to override this behavior?


Here's some sample data:

library(data.table)
DT <- data.table(v1 = c(1.105, 1.105, 1.105, 2.012, 2.012, 2.012),
                 ID = c(1L, 1L, 1L, 2L, 2L, 2L), 
                 v2 = structure(c(2L, 3L, 5L, 1L, 2L, 6L), 
                                .Label = c("1", "2", "3", "4", "5", "6"), 
                                class = "factor"),
                 v3 = c(3L, 2L, 2L, 5L, 4L, 3L)) 

Notice that "v2" is a factor column with 6 levels. I essentially want to go from "long" to wide", but add in columns for any missing factor levels (in this case "4").

reshape handles the shape, but not the missing columns:

reshape(DT, direction = "wide", idvar = c("ID", "v1"), timevar = "v2")
#       v1 ID v3.2 v3.3 v3.5 v3.1 v3.6
# 1: 1.105  1    3    2    2   NA   NA
# 2: 2.012  2    4   NA   NA    5    3

dcast handles adding the missing columns, but only if there's one value on the LHS:

dcast(DT, ID ~ v2, value.var = "v3", drop = FALSE)
#    ID  1 2  3  4  5  6
# 1:  1 NA 3  2 NA  2 NA
# 2:  2  5 4 NA NA NA  3

If there are multiple values on the LHS, the combinations of the values on the LHS are also expanded out, as if we had used CJ or expand.grid, but rows 2 and 3 are not at all of interest to me:

dcast(DT, ... ~ v2, value.var = "v3", drop = FALSE)
#       v1 ID  1  2  3  4  5  6
# 1: 1.105  1 NA  3  2 NA  2 NA
# 2: 1.105  2 NA NA NA NA NA NA
# 3: 2.012  1 NA NA NA NA NA NA
# 4: 2.012  2  5  4 NA NA NA  3

This is similar to using xtabs in base R: ftable(xtabs(v3 ~ ID + v1 + v2, DT)).


Is there a way to let dcast know that essentially, "Hey. The combination of values on the LHS are the IDs. Don't try to fill them in for me."

My current approach is to do three steps, one for collapsing down the LHS values, another for spreading out the RHS values, and then one for merging the result.

merge(DT[, list(v1 = unique(v1)), .(ID)],  ## or unique(DT[, c("ID", "v1"), with = FALSE])
      dcast(DT, ID ~ v2, value.var = "v3", drop = FALSE), 
      by = "ID")[]
#    ID    v1  1 2  3  4  5  6
# 1:  1 1.105 NA 3  2 NA  2 NA
# 2:  2 2.012  5 4 NA NA NA  3

Is there a better approach that I'm missing?

like image 563
A5C1D2H2I1M1N2O1R2T1 Avatar asked Jan 16 '16 19:01

A5C1D2H2I1M1N2O1R2T1


People also ask

What does dcast do in R?

dcast: Convert data between wide and long forms.

What is value var in dcast?

dcast uses formula interface. The variables on the LHS of formula represents the id vars and RHS the measure vars. value. var denotes the column to be filled in with while casting to wide format.

How do I Dcast data in R?

In this example, we use the dcast() function to reshape our data. That is, we can use dcast() to order our data rows according to specific variables. Table 5 reveals the output of the previous R programming syntax – Before, the data was ordered as time, feature, ID. Now it is ordered as ID, time, feature.

What is reshape2 package in R?

reshape2 is an R package written by Hadley Wickham that makes it easy to transform data between wide and long formats.


1 Answers

Just implemented in data.table development version v1.9.7, commit 2113, closes #1512.

require(data.table) # v1.9.7, commit 2113+
dcast(DT, ... ~ v2, value.var = "v3", drop = c(TRUE, FALSE))
#       v1 ID  1 2  3  4  5  6
# 1: 1.105  1 NA 3  2 NA  2 NA
# 2: 2.012  2  5 4 NA NA NA  3
like image 64
Arun Avatar answered Sep 29 '22 13:09

Arun