Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Update a column of NAs in one data table with the value from a column in another data table

I have seen a similar post on this topic, but could not comprehend the solution. Essentially I have a data table (DT1) with scores and some NAs. Wherever there is an 'NA' I would like a process that references another data (DT2) table that has a common column in both tables (tract) and then update the NA in DT1 with the value of DT2's score column. I am choosing a data table class because I think it is a an efficient option.

DT1

         tract CreditScore    
1: 36107020401         635  
2: 36083052403          NA  
3: 36091062602          NA  
4: 36067013000          NA  
5: 36083052304          NA  

DT2

         tract CreditScore  
1: 36107020401         635  
2: 36083052403         650  
3: 36091062602         335  
4: 36067013000         777  
5: 36083052304         663  
like image 849
user3067851 Avatar asked Mar 05 '15 22:03

user3067851


1 Answers

We've created new (and more comprehensive) HTML vignettes for some of the data.table concepts. Have a look here for the other vignettes that we are working on. I'm working on vignettes for joins, which when done will hopefully clarify these type of problems better.


The idea is to first setkey() on DT1 on the column tract.

setkey(DT1, tract)

In data.tables, a join of the form x[i] requires key for x, but not necessarily for i. This results in two scenarios:

  • If i also has key set -- the first key column of i is matched against first key column of x, second against second and so on..

  • If i doesn't have key set -- the first column of i is matched against the first key column of x, second column of i against second key column of x and so on..

In this case, since your first column in i is also tract, we'll skip setting key on i.

Then, we perform a join of the form x[i]. By doing this, for each i the matching row indices in x is computed, and then the join result is materialised. However, we don't want the entire join result as a new data.table. Rather, we want to update DT1's CreditScore column with DT2's on those matching rows..

In data.tables, we can perform that operation while joining, by providing the expression in j, as follows:

DT1[DT2, CreditScore := i.CreditScore]
#          tract CreditScore
# 1: 36067013000         777
# 2: 36083052304         663
# 3: 36083052403         650
# 4: 36091062602         335
# 5: 36107020401         635

DT1[DT2 part finds the matching rows in DT1 for each row in DT2. And if there's a match, we want DT2's value to be updated in DT1. We accomplish that by using i.CreditScore -- it refers to DT2's CreditScore column (i. is a prefix used to distinguish columns with identical names between x and i data.tables).


Update: As pointed out under comments, the solution above would also update the non-NA values in DT1. Therefore the way to do it would be:

DT1[is.na(CreditScore), CreditScore := DT2[.(.SD), CreditScore]]

On those rows where CreditScore from DT1 is NA, replace CreditScore from DT1 with the values from CreditScore obtained from the join of DT2[.(.SD)], where .SD corresponds to the subset of data.table that contains all the rows where CreditScore is NA.

HTH

like image 89
Arun Avatar answered Oct 14 '22 15:10

Arun