I have seen a similar post on this topic, but could not comprehend the solution. Essentially I have a data table (DT1) with scores and some NAs. Wherever there is an 'NA' I would like a process that references another data (DT2) table that has a common column in both tables (tract) and then update the NA in DT1 with the value of DT2's score column. I am choosing a data table class because I think it is a an efficient option.
DT1
tract CreditScore
1: 36107020401 635
2: 36083052403 NA
3: 36091062602 NA
4: 36067013000 NA
5: 36083052304 NA
DT2
tract CreditScore
1: 36107020401 635
2: 36083052403 650
3: 36091062602 335
4: 36067013000 777
5: 36083052304 663
We've created new (and more comprehensive) HTML vignettes for some of the data.table concepts. Have a look here for the other vignettes that we are working on. I'm working on vignettes for joins, which when done will hopefully clarify these type of problems better.
The idea is to first setkey()
on DT1
on the column tract
.
setkey(DT1, tract)
In data.tables, a join of the form x[i]
requires key for x
, but not necessarily for i
. This results in two scenarios:
If i
also has key set -- the first key column of i
is matched against first key column of x
, second against second and so on..
If i
doesn't have key set -- the first column of i
is matched against the first key column of x
, second column of i
against second key column of x
and so on..
In this case, since your first column in i
is also tract
, we'll skip setting key on i
.
Then, we perform a join of the form x[i]
. By doing this, for each i
the matching row indices in x
is computed, and then the join result is materialised. However, we don't want the entire join result as a new data.table. Rather, we want to update DT1
's CreditScore
column with DT2
's on those matching rows..
In data.tables, we can perform that operation while joining, by providing the expression in j
, as follows:
DT1[DT2, CreditScore := i.CreditScore]
# tract CreditScore
# 1: 36067013000 777
# 2: 36083052304 663
# 3: 36083052403 650
# 4: 36091062602 335
# 5: 36107020401 635
DT1[DT2
part finds the matching rows in DT1
for each row in DT2
. And if there's a match, we want DT2
's value to be updated in DT1
. We accomplish that by using i.CreditScore
-- it refers to DT2
's CreditScore
column (i.
is a prefix used to distinguish columns with identical names between x
and i
data.tables).
Update: As pointed out under comments, the solution above would also update the non-NA values in DT1
. Therefore the way to do it would be:
DT1[is.na(CreditScore), CreditScore := DT2[.(.SD), CreditScore]]
On those rows where CreditScore
from DT1
is NA
, replace CreditScore
from DT1
with the values from CreditScore
obtained from the join of DT2[.(.SD)]
, where .SD
corresponds to the subset of data.table that contains all the rows where CreditScore
is NA
.
HTH
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With