I have a dataset (mydata
) that contains multiple columns which could fit inside ranges that are stored in another dataset (mycomparison
).
I would like to join mycomparison
to mydata
where the mydata
values are within the ranges in mycomparison
.
library(data.table)
mydata<-data.table(
id=1:5,
val1=seq(10000, 50000, by=10000),
val2=floor(rnorm(5,mean=400,sd=100)),
val3=rnorm(5,mean=.7,sd=.1)
)
mycomparison<-data.table(
Name=LETTERS[1:3],
minval1=c(0,30000,10000),
maxval1=c(50000,80000,30000),
minval2=c(300,400,300),
maxval2=c(800,800,800),
minval3=c(0,.5,.2),
maxval3=c(1,.9,.8),
correspondingval=c(.1,.2,.3)
)
> mydata.withmatches
id val1 val2 val3 Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1: 1 10000 387 0.4844319 A 0 50000 300 800 0 1 0.1
2: 2 20000 425 0.7856313 NA NA NA NA NA NA NA NA
3: 3 30000 324 0.8063969 NA NA NA NA NA NA NA NA
4: 4 40000 263 0.5590113 NA NA NA NA NA NA NA NA
5: 5 50000 187 0.8764396 NA NA NA NA NA NA NA NA
This feels/is very clunky and involves cross-joining the data (using optiRum::CJ.dt
), doing a big logical check, and then reassembling the data.
library(optiRum)
workingdt<-CJ.dt(mydata,mycomparison)
matched<-workingdt[val1>=minval1 &
val1<=maxval1 &
val2>=minval2 &
val2<=maxval2 &
val3>=minval3 &
val3<=maxval3][which.min(correspondingval)]
notmatched<-mydata[id!= matched[,id]]
all<-list(matched,notmatched)
mydata.withmatches<- rbindlist(all, fill=TRUE, use.names=TRUE)
I'm aware of foverlaps
but it will work on a single interval, not on many ranges like in this instance.
I'm hoping for a less clunky and more elegant solution.
Open the workspace that contains the data sets that you want to identify overlaps for, select one or more data sets, and click Run relationship analysis. Select Overlap analysis. Then, click Analyze. Note: You can run a key relationship analysis at the same time that you run an overlap analysis.
The Overlap Analysis feature shows you, for each database in your collection, the number of e-journal and e-book titles that are unique to that database, and the number that are available elsewhere in your collection. Overlap Analysis compares full-text journal titles and e-books.
I do not exactly understand your Desired Output, because multiple id's match the mycomparison
data.table. Using your data (rounded to two decimal places):
> mydata
id val1 val2 val3
1: 1 10000 387 0.48
2: 2 20000 425 0.79
3: 3 30000 324 0.81
4: 4 40000 263 0.56
5: 5 50000 187 0.88
And
> mycomparison
Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1: A 0 50000 300 800 0.0 1.0 0.1
2: B 30000 80000 400 800 0.5 0.9 0.2
3: C 10000 30000 300 800 0.2 0.8 0.3
This gives:
> workingdt
id val1 val2 val3 Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1: 1 10000 387 0.48 A 0 50000 300 800 0.0 1.0 0.1
2: 2 20000 425 0.79 A 0 50000 300 800 0.0 1.0 0.1
3: 3 30000 324 0.81 A 0 50000 300 800 0.0 1.0 0.1
4: 4 40000 263 0.56 A 0 50000 300 800 0.0 1.0 0.1
5: 5 50000 187 0.88 A 0 50000 300 800 0.0 1.0 0.1
6: 1 10000 387 0.48 B 30000 80000 400 800 0.5 0.9 0.2
7: 2 20000 425 0.79 B 30000 80000 400 800 0.5 0.9 0.2
8: 3 30000 324 0.81 B 30000 80000 400 800 0.5 0.9 0.2
9: 4 40000 263 0.56 B 30000 80000 400 800 0.5 0.9 0.2
10: 5 50000 187 0.88 B 30000 80000 400 800 0.5 0.9 0.2
11: 1 10000 387 0.48 C 10000 30000 300 800 0.2 0.8 0.3
12: 2 20000 425 0.79 C 10000 30000 300 800 0.2 0.8 0.3
13: 3 30000 324 0.81 C 10000 30000 300 800 0.2 0.8 0.3
14: 4 40000 263 0.56 C 10000 30000 300 800 0.2 0.8 0.3
15: 5 50000 187 0.88 C 10000 30000 300 800 0.2 0.8 0.3
And leaving off your which.min()
:
> workingdt[val1>=minval1 & val1<= maxval1 & val2>=minval2 &
val2<=maxval2 & val3>=minval3 & val3<=maxval3]
id val1 val2 val3 Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1: 1 10000 387 0.48 A 0 50000 300 800 0.0 1.0 0.1
2: 2 20000 425 0.79 A 0 50000 300 800 0.0 1.0 0.1
3: 3 30000 324 0.81 A 0 50000 300 800 0.0 1.0 0.1
4: 1 10000 387 0.48 C 10000 30000 300 800 0.2 0.8 0.3
5: 2 20000 425 0.79 C 10000 30000 300 800 0.2 0.8 0.3
If you use the data.table group-by functionality, you can pick the min(correspondingval)
for each id
(I am leaving off the unmatched data for the moment):
> workingdt[val1>=minval1 & val1<= maxval1 & val2>=minval2 &
val2<=maxval2 & val3>=minval3 & val3<=maxval3]
[,.SD[which.min(correspondingval)], by=id]
id val1 val2 val3 Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1: 1 10000 387 0.48 A 0 50000 300 800 0 1 0.1
2: 2 20000 425 0.79 A 0 50000 300 800 0 1 0.1
3: 3 30000 324 0.81 A 0 50000 300 800 0 1 0.1
Or, the max(correspondingval)
if you prefer:
> workingdt[val1>=minval1 & val1<= maxval1 & val2>=minval2 &
val2<=maxval2 & val3>=minval3 & val3<=maxval3]
[,.SD[which.max(correspondingval)], by=id]
id val1 val2 val3 Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1: 1 10000 387 0.48 C 10000 30000 300 800 0.2 0.8 0.3
2: 2 20000 425 0.79 C 10000 30000 300 800 0.2 0.8 0.3
3: 3 30000 324 0.81 A 0 50000 300 800 0.0 1.0 0.1
If all you want--as shown in your Desired Output--is the first row with the minimum correspondingval
and everything else with NA
there are easier ways to do this. If you want to know where each id
matches a range--as I have shown in my output--then a cleaner, more elegant solution is different.
Let me know.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With