Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add missing rows to data.table according to multiple keyed columns

I have a data.table object that contains multiple columns that specify unique cases. In the small example below, the variables "name", "job", and "sex" specify the unique IDs. I would like to add missing rows so that each each case has a row for each possible instance of another variable, "from" (similar to expand.grid).

library(data.table)
set.seed(1)
mydata <- data.table(name = c("john","john","john","john","mary","chris","chris","chris"),
                 job = c("teacher","teacher","teacher","teacher","police","lawyer","lawyer","doctor"),
                 sex = c("male","male","male","male","female","female","male","male"),
                 from = c("NYT","USAT","BG","TIME","USAT","BG","NYT","NYT"),
                 score = rnorm(8))

setkeyv(mydata, cols=c("name","job","sex"))

mydata[CJ(unique(name, job, sex), unique(from))]

Here's the current data.table object:

> mydata
    name     job    sex from      score
1:  john teacher   male  NYT -0.6264538
2:  john teacher   male USAT  0.1836433
3:  john teacher   male   BG -0.8356286
4:  john teacher   male TIME  1.5952808
5:  mary  police female USAT  0.3295078
6: chris  lawyer female   BG -0.8204684
7: chris  lawyer   male  NYT  0.4874291
8: chris  doctor   male  NYT  0.7383247

Here's the result I'd like:

> mydata
     name     job    sex from      score
1:   john teacher   male  NYT -0.6264538
2:   john teacher   male USAT  0.1836433
3:   john teacher   male   BG -0.8356286
4:   john teacher   male TIME  1.5952808
5:   mary  police female  NYT  NA
6:   mary  police female USAT  0.3295078
7:   mary  police female   BG  NA
8:   mary  police female TIME  NA
9:  chris  lawyer female  NYT -NA
10: chris  lawyer female USAT -NA
11: chris  lawyer female   BG -0.8204684
12: chris  lawyer female TIME -NA
13: chris  lawyer   male  NYT  0.4874291
14: chris  lawyer   male USAT  NA
15: chris  lawyer   male   BG  NA
16: chris  lawyer   male TIME  NA
17: chris  doctor   male  NYT  0.7383247
18: chris  doctor   male USAT  NA
19: chris  doctor   male   BG  NA
20: chris  doctor   male TIME  NA

Here's what I've tried:

setkeyv(mydata, cols=c("name","job","sex"))
mydata[CJ(unique(name, job, sex), unique(from))]

But I receive the following error and adding fromLast=TRUE (or FALSE) does not give me the right solution:

Error in unique.default(name, job, sex) : 
  'fromLast' must be TRUE or FALSE

Here are the relevant answers I've come across (but none appears to deal with multiple keyed columns): add missing rows to a data table

Efficiently inserting default missing rows in a data.table

Fastest way to add rows for missing values in a data.frame?

like image 954
itpetersen Avatar asked Dec 09 '14 05:12

itpetersen


1 Answers

A couple of possibilities are here - https://github.com/Rdatatable/data.table/pull/814

CJ.dt = function(...) {
  rows = do.call(CJ, lapply(list(...), function(x) if(is.data.frame(x)) seq_len(nrow(x)) else seq_along(x)));
  do.call(data.table, Map(function(x, y) x[y], list(...), rows))
}

setkey(mydata, name, job, sex, from)

mydata[CJ.dt(unique(data.table(name, job, sex)), unique(from))]
#     name     job    sex from      score
# 1: chris  doctor   male  NYT  0.7383247
# 2: chris  doctor   male   BG         NA
# 3: chris  doctor   male TIME         NA
# 4: chris  doctor   male USAT         NA
# 5: chris  lawyer female  NYT         NA
# 6: chris  lawyer female   BG -0.8204684
# 7: chris  lawyer female TIME         NA
# 8: chris  lawyer female USAT         NA
# 9: chris  lawyer   male  NYT  0.4874291
#10: chris  lawyer   male   BG         NA
#11: chris  lawyer   male TIME         NA
#12: chris  lawyer   male USAT         NA
#13:  john teacher   male  NYT -0.6264538
#14:  john teacher   male   BG -0.8356286
#15:  john teacher   male TIME  1.5952808
#16:  john teacher   male USAT  0.1836433
#17:  mary  police female  NYT         NA
#18:  mary  police female   BG         NA
#19:  mary  police female TIME         NA
#20:  mary  police female USAT  0.3295078
like image 111
eddi Avatar answered Sep 23 '22 13:09

eddi