Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Staggered join / new grouping / join in R with data.table's X[Y] syntax

Tags:

r

data.table

I have two data.tables: samples, resources

resources is connected with samples via primary and secondary ids. I want to combine the information from the resources with the sample-table first via the primary id, and only if this produces NA, then I want to resort to the secondary resources from the same table (within one data.table command chain).

# resources:
   primary secondary info
1:      17        42  "I"
2:      18        NA  "J"
3:      19        43  "K"

# samples:
   name primary secondary
1:  "a"      17        55
2:  "b"       0        42
3   "c"      18        42

The desired result would be:

# joined tables:
   name info  # primary secondary
1:  "a"  "I"
2:  "b"  "I"
3:  "c"  "J"

The first join via primary is easy, it produces

# Update:
samples <- data.table(name = letters[1:3], 
                      primary = c(17, 0, 18), 
                      secondary = c(55, 42, 42))
resources <- data.table(primary = 17:19, 
                        secondary = c(42, NA, 43), 
                        info = LETTERS[9:11])
# first join:
setkey(samples, primary)
setkey(resources, primary)
samples[resources]

   name info  # primary secondary
1:  "a"  "I"
2:  "b"   NA
3:  "c"  "J"

But then? I need to re-key samples with setkey(samples, secondary), right? And then subset to only those rows that produces NAs. But all this is not really possible within one command chain (and imagine there were more than two criteria...). How can I achieve this more succinctly?

... updated with code for the data.tables.

like image 567
Wordsmyth Avatar asked Apr 23 '15 13:04

Wordsmyth


3 Answers

While you could do it on a single line, I think that obscures the meaning of what you do, makes things incredibly hard to read/understand/debug/remember what the hell you did in a month, and is simply a bad idea.

Smaller, much more easily digestible chunks are the way to go imo:

setkey(samples, primary)
setkey(resources, primary)
samples[resources, info := i.info]

setkey(samples, secondary)
setkey(resources, secondary)
samples[resources, info := ifelse(is.na(info), i.info, info)]

samples
#   name primary secondary info
#1:    b       0        42    I
#2:    c      18        42    J
#3:    a      17        55    I

# keep going with tertiary and so on if you like

As @nachti pointed out in the comments, you might need to add allow.cartesian=TRUE for versions before 1.9.5 depending on your data.

like image 107
eddi Avatar answered Oct 23 '22 04:10

eddi


This would be one chain with 2 calls to resources, one of them re-setkey behind the scene.

library(data.table)
samples <- data.table(name = letters[1:3], 
                      primary = c(17, 0, 18), 
                      secondary = c(55, 42, 42))
resources <- data.table(primary = 17:19, 
                        secondary = c(42, NA, 43), 
                        info = LETTERS[9:11])
setkey(samples, primary)
setkey(resources, primary)
samples[resources, info := i.info
        ][, .(name, info),, secondary
          ][resources[, info,, secondary], info := ifelse(is.na(info), i.info, info)
            ][, secondary := NULL]

As you are asking about more complicated examples. It's worth to note the data.table queries can be easily managed as modules by preparing sub-query arguments in advance. They can be later easily conditionally managed. See below example.

lkp2 <- quote(resources[, info,, secondary])
lkp2_formula <- quote(info := ifelse(is.na(info), i.info, info))
setkey(samples, primary)
samples[resources, info := i.info
        ][, .(name, info),, secondary
          ][eval(lkp2), eval(lkp2_formula)
            ][, secondary := NULL]

If you heavily rely on data.table chaining processes you may find dtq package useful.

like image 41
jangorecki Avatar answered Oct 23 '22 03:10

jangorecki


I think it's too tricky to do it within one command chain, but I've a solution for you:

### First step
samples[resources[samples, nomatch = 0], info := info]
samples

   name primary secondary info
1:    b       0        42   NA
2:    a      17        55    I
3:    c      18        42    J

### Second step
setkey(samples, secondary)
setkey(resources, secondary)
## create new column info1
samples[resources[samples[is.na(info)],
                  list(info1 = unique(info)), by = .EACHI],
        info1 := info1]
## merge it to samples, where info is NA
samples[is.na(info), info := info1]
## remove info1 (and maybe other unused columns)
samples[, info1 := NULL]
## sort samples by name
setkey(samples, name)
samples

   name primary secondary info
1:    a      17        55    I
2:    b       0        42    I
3:    c      18        42    J

HTH
~g

like image 1
nachti Avatar answered Oct 23 '22 03:10

nachti