Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table equivalent of tidyr::complete()

tidyr::complete() adds rows to a data.frame for combinations of column values that are missing from the data. Example:

library(dplyr) library(tidyr)  df <- data.frame(person = c(1,2,2),                  observation_id = c(1,1,2),                  value = c(1,1,1)) df %>%   tidyr::complete(person,                   observation_id,                   fill = list(value=0)) 

yields

# A tibble: 4 × 3   person observation_id value    <dbl>          <dbl> <dbl> 1      1              1     1 2      1              2     0 3      2              1     1 4      2              2     1 

where the value of the combination person == 1 and observation_id == 2 that is missing in df has been filled in with a value of 0.

What would be the equivalent of this in data.table?

like image 407
RoyalTS Avatar asked Apr 18 '17 22:04

RoyalTS


Video Answer


1 Answers

I reckon that the philosophy of data.table entails fewer specially-named functions for tasks than you'll find in the tidyverse, so some extra coding is required, like:

res = setDT(df)[   CJ(person = person, observation_id = observation_id, unique=TRUE),    on=.(person, observation_id) ] 

After this, you still have to manually handle the filling of values for missing levels. We can use setnafill to handle this efficiently & by-reference in recent versions of data.table:

setnafill(res, fill = 0, cols = 'value') 

See @Jealie's answer regarding a feature that will sidestep this.


Certainly, it's crazy that the column names have to be entered three times here. But on the other hand, one can write a wrapper:

completeDT <- function(DT, cols, defs = NULL){   mDT = do.call(CJ, c(DT[, ..cols], list(unique=TRUE)))   res = DT[mDT, on=names(mDT)]   if (length(defs))      res[, names(defs) := Map(replace, .SD, lapply(.SD, is.na), defs), .SDcols=names(defs)]   res[] }   completeDT(setDT(df), cols = c("person", "observation_id"), defs = c(value = 0))     person observation_id value 1:      1              1     1 2:      1              2     0 3:      2              1     1 4:      2              2     1 

As a quick way of avoiding typing the names three times for the first step, here's @thelatemail's idea:

vars <- c("person","observation_id") df[do.call(CJ, c(mget(vars), unique=TRUE)), on=vars]  # or with magrittr... c("person","observation_id") %>% df[do.call(CJ, c(mget(.), unique=TRUE)), on=.] 

Update: now you don't need to enter names twice in CJ thanks to @MichaelChirico & @MattDowle for the improvement.

like image 186
Frank Avatar answered Sep 29 '22 07:09

Frank