I have two data.tables
: samples, resources
resources
is connected with samples
via primary
and secondary
ids.
I want to combine the information from the resources with the sample-table first via the primary id, and only if this produces NA, then I want to resort to the secondary resources from the same table (within one data.table command chain).
# resources:
primary secondary info
1: 17 42 "I"
2: 18 NA "J"
3: 19 43 "K"
# samples:
name primary secondary
1: "a" 17 55
2: "b" 0 42
3 "c" 18 42
The desired result would be:
# joined tables:
name info # primary secondary
1: "a" "I"
2: "b" "I"
3: "c" "J"
The first join via primary
is easy, it produces
# Update:
samples <- data.table(name = letters[1:3],
primary = c(17, 0, 18),
secondary = c(55, 42, 42))
resources <- data.table(primary = 17:19,
secondary = c(42, NA, 43),
info = LETTERS[9:11])
# first join:
setkey(samples, primary)
setkey(resources, primary)
samples[resources]
name info # primary secondary
1: "a" "I"
2: "b" NA
3: "c" "J"
But then? I need to re-key samples with setkey(samples, secondary)
, right? And then subset to only those rows that produces NAs. But all this is not really possible within one command chain (and imagine there were more than two criteria...). How can I achieve this more succinctly?
... updated with code for the data.tables.
While you could do it on a single line, I think that obscures the meaning of what you do, makes things incredibly hard to read/understand/debug/remember what the hell you did in a month, and is simply a bad idea.
Smaller, much more easily digestible chunks are the way to go imo:
setkey(samples, primary)
setkey(resources, primary)
samples[resources, info := i.info]
setkey(samples, secondary)
setkey(resources, secondary)
samples[resources, info := ifelse(is.na(info), i.info, info)]
samples
# name primary secondary info
#1: b 0 42 I
#2: c 18 42 J
#3: a 17 55 I
# keep going with tertiary and so on if you like
As @nachti pointed out in the comments, you might need to add allow.cartesian=TRUE
for versions before 1.9.5 depending on your data.
This would be one chain with 2 calls to resources
, one of them re-setkey behind the scene.
library(data.table)
samples <- data.table(name = letters[1:3],
primary = c(17, 0, 18),
secondary = c(55, 42, 42))
resources <- data.table(primary = 17:19,
secondary = c(42, NA, 43),
info = LETTERS[9:11])
setkey(samples, primary)
setkey(resources, primary)
samples[resources, info := i.info
][, .(name, info),, secondary
][resources[, info,, secondary], info := ifelse(is.na(info), i.info, info)
][, secondary := NULL]
As you are asking about more complicated examples. It's worth to note the data.table queries can be easily managed as modules by preparing sub-query arguments in advance. They can be later easily conditionally managed. See below example.
lkp2 <- quote(resources[, info,, secondary])
lkp2_formula <- quote(info := ifelse(is.na(info), i.info, info))
setkey(samples, primary)
samples[resources, info := i.info
][, .(name, info),, secondary
][eval(lkp2), eval(lkp2_formula)
][, secondary := NULL]
If you heavily rely on data.table chaining processes you may find dtq package useful.
I think it's too tricky to do it within one command chain, but I've a solution for you:
### First step
samples[resources[samples, nomatch = 0], info := info]
samples
name primary secondary info
1: b 0 42 NA
2: a 17 55 I
3: c 18 42 J
### Second step
setkey(samples, secondary)
setkey(resources, secondary)
## create new column info1
samples[resources[samples[is.na(info)],
list(info1 = unique(info)), by = .EACHI],
info1 := info1]
## merge it to samples, where info is NA
samples[is.na(info), info := info1]
## remove info1 (and maybe other unused columns)
samples[, info1 := NULL]
## sort samples by name
setkey(samples, name)
samples
name primary secondary info
1: a 17 55 I
2: b 0 42 I
3: c 18 42 J
HTH
~g
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With