Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R Multiple condition join using data.table

Tags:

r

data.table

I have a large dataset and a lookup table. I need to return for each row in the dataset the smallest value present for rows in the lookup where conditions are met.

Given the size of my dataset I'm reluctant to hack an iffy solution together by cross-joining as this would create many millions of records. I'm hoping someone can suggest a solution that (ideally) leverages base r or data.table since these are already in use in an efficient manner.

Example

A<-seq(1e4,9e4,1e4)
B<-seq(0,1e4,1e3)

dt1<-data.table(expand.grid(A,B),ID=1:nrow(expand.grid(A,B)))
setnames(dt1, c("Var1","Var2"),c("A","B"))

lookup<-data.table(minA=c(1e4,1e4,2e4,2e4,5e4),
                 maxA=c(2e4,3e4,7e4,6e4,9e4),
                 minB=rep(2e3,5),
                 Val=seq(.1,.5,.1))

# Sample  Desired Value
     A     B    ID Val
99: 90000 10000 99 0.5

In SQL, I would then write something along the lines of

SELECT ID, A, B, min(Val) as Val
FROM dt1
LEFT JOIN lookup on dt1.A>=lookup.minA
                 and dt1.A<=lookup.maxA
                 and dt1.B>=lookup.minB
GROUP BY ID, A, B

Which would join all matching records from lookup to dt1 and return the smallest Val.

Update

My solution so far looks like:

CJ.table<-function(X,Y) setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]

dt1.lookup<- CJ.table(dt1,lookup)[A>=minA & A<=maxA & B>=minB,
                                  list(Val=Val[which.min( Val)]),
                                  by=list(ID,A,B)]
dt1.lookup<-rbind.fill(dt1.lookup, dt1[!ID %in% dt1.lookup$ID])

This retrieves all records and allows the return of additional columns from the lookup table if I need them. It also has the benefit of enforcing the pick of the minimum Val.

like image 761
Steph Locke Avatar asked Nov 02 '22 05:11

Steph Locke


1 Answers

A solution I found without cross joining first needs to prepare the data by getting rid of rows where A and B are out of range entirely:

Prep = dt1[A >= min(lookup$minA) & A <= max(lookup$maxA) & B >= min(lookup$minB)]

Then you make a data table of where each of the conditions are met that correspond to the lowest possible Val:

Indices = Prep[,list(min(which(A >= lookup$minA)), 
                     min(which(A <= lookup$maxA)), 
                     min(which(B >= lookup$minB)), A, B),by=ID]

Then you must get Val at the lowest point where all three conditions are satisfied:

Indices[,list(Val=lookup$Val[max(V1,V2,V3)], A, B),by=ID]

See if this gets you what you're looking for:

   ID Val     A     B
 1: 19 0.1 10000  2000
 2: 20 0.1 20000  2000
 3: 21 0.2 30000  2000
 4: 22 0.3 40000  2000
 5: 23 0.3 50000  2000
 6: 24 0.3 60000  2000
 7: 25 0.3 70000  2000
 8: 26 0.5 80000  2000
 9: 27 0.5 90000  2000
10: 28 0.1 10000  3000
like image 149
Señor O Avatar answered Nov 09 '22 15:11

Señor O