I have a large dataset and a lookup table. I need to return for each row in the dataset the smallest value present for rows in the lookup where conditions are met.
Given the size of my dataset I'm reluctant to hack an iffy solution together by cross-joining as this would create many millions of records. I'm hoping someone can suggest a solution that (ideally) leverages base r or data.table since these are already in use in an efficient manner.
Example
A<-seq(1e4,9e4,1e4)
B<-seq(0,1e4,1e3)
dt1<-data.table(expand.grid(A,B),ID=1:nrow(expand.grid(A,B)))
setnames(dt1, c("Var1","Var2"),c("A","B"))
lookup<-data.table(minA=c(1e4,1e4,2e4,2e4,5e4),
maxA=c(2e4,3e4,7e4,6e4,9e4),
minB=rep(2e3,5),
Val=seq(.1,.5,.1))
# Sample Desired Value
A B ID Val
99: 90000 10000 99 0.5
In SQL, I would then write something along the lines of
SELECT ID, A, B, min(Val) as Val
FROM dt1
LEFT JOIN lookup on dt1.A>=lookup.minA
and dt1.A<=lookup.maxA
and dt1.B>=lookup.minB
GROUP BY ID, A, B
Which would join all matching records from lookup
to dt1
and return the smallest Val
.
Update
My solution so far looks like:
CJ.table<-function(X,Y) setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]
dt1.lookup<- CJ.table(dt1,lookup)[A>=minA & A<=maxA & B>=minB,
list(Val=Val[which.min( Val)]),
by=list(ID,A,B)]
dt1.lookup<-rbind.fill(dt1.lookup, dt1[!ID %in% dt1.lookup$ID])
This retrieves all records and allows the return of additional columns from the lookup table if I need them. It also has the benefit of enforcing the pick of the minimum Val.
A solution I found without cross joining first needs to prepare the data by getting rid of rows where A
and B
are out of range entirely:
Prep = dt1[A >= min(lookup$minA) & A <= max(lookup$maxA) & B >= min(lookup$minB)]
Then you make a data table of where each of the conditions are met that correspond to the lowest possible Val
:
Indices = Prep[,list(min(which(A >= lookup$minA)),
min(which(A <= lookup$maxA)),
min(which(B >= lookup$minB)), A, B),by=ID]
Then you must get Val
at the lowest point where all three conditions are satisfied:
Indices[,list(Val=lookup$Val[max(V1,V2,V3)], A, B),by=ID]
See if this gets you what you're looking for:
ID Val A B
1: 19 0.1 10000 2000
2: 20 0.1 20000 2000
3: 21 0.2 30000 2000
4: 22 0.3 40000 2000
5: 23 0.3 50000 2000
6: 24 0.3 60000 2000
7: 25 0.3 70000 2000
8: 26 0.5 80000 2000
9: 27 0.5 90000 2000
10: 28 0.1 10000 3000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With