I'm trying to finding the nearest value for each treated observations. The data look as follows (a partial data from 1.2M obs):
> dta
id treatment score
1: 5 0 0.02381024
2: 10 0 0.05428605
3: 22 0 0.02118124
4: 27 0 0.01495214
5: 45 0 0.01877916
6: 50 0 0.02120360
7: 58 0 0.02207263
8: 60 0 0.02807019
9: 61 0 0.05432927
10: 65 1 0.59612077
11: 68 0 0.02482168
12: 72 1 0.14582400
13: 73 0 0.02371670
14: 77 0 0.02608826
15: 87 0 0.06852409
16: 88 0 0.07473471
17: 94 0 0.07160314
18: 97 0 0.02040747
19: 104 1 0.09878789
20: 108 0 0.02421807
For each treated observations (i.e., treatment = 1) I'd like to get an untreated observation (i.e., treatment = 0) with the nearest score and mark the chosen observation as unavaiable for other treated observations to match.
For example, the first treated observation (row 10) will matched to id = 88 (row 16), row 12 to row 17, and so on. Currently I'm running the floowing loop:
smpl_treated = dta[treatment == 1]
smpl_untreated = dta[treatment == 0]
n_tmp = nrow(smpl_treated)
matched_id = matrix(0, n_tmp, 1)
smpl_tmp = smpl_untreated
for (i in 1:nrow(smpl_treated)) {
x = smpl_treated[i]$score
setkey(smpl_tmp, score)
tmp = smpl_tmp[J(x), roll = "nearest"]
matched_id[i] = tmp[[1]]
smpl_tmp = smpl_tmp[id != tmp[[1]]]
}
matched_smpl = smpl_untreated[id %in% matched_id]
> matched_smpl
id treatment score
1: 87 0 0.06852409
2: 94 0 0.07160314
3: 88 0 0.07473471
Any suggestions to make this happen within a data.table or make the loop faster? With the original 1.2M obs the loop takes over 2 hours. Thanks for your help in advance!
This elaborates the already accepted answer of denis using the actual possibilities of data.table
syntax, e.g., use the on
parameter instead of setkey()
when joining.
# determine the minimum number of treated and untreated cases
n <- min(dta[treatment == 0L, .N], dta[treatment == 1L, .N])
# order by descending score
mdt <- dta[order(-score)][
# and pick the ids of the top n treated and untreated cases
# so that the highest untreated score match the highest treated score,
# the 2nd highest untreated the 2nd highest treated and so forth
, .(id0 = head(.SD[treatment == 0L, id], n), id1 = head(.SD[treatment == 1L, id], n))]
mdt
id0 id1 1: 88 65 2: 94 72 3: 87 104
# join the ids two times to show the data of the treated and untreated cases
dta[dta[mdt, on = .(id==id0)], on = .(id = id1)]
id treatment score i.id i.treatment i.score 1: 65 1 0.59612077 88 0 0.07473471 2: 72 1 0.14582400 94 0 0.07160314 3: 104 1 0.09878789 87 0 0.06852409
I may have a solution if you order your data table, make a subset and use the power of merging. Not sure it is the best solution, but it seems to work for what I understood you want to do, and it will be for sure faster than your loop:
library(data.table)
dta <- data.table(id = c(5,10,22,27,45,50,58,60,61,65,68,72,73,77,87,88,94,97,104,108),
treatment = c(0, 0 ,0 ,0, 0, 0, 0 ,0 , 0 , 1, 0 ,1 ,0, 0 ,0 ,0 ,0 ,0 ,1 ,0),
score = c(0.02381024, 0.05428605, 0.02118124, 0.01495214, 0.01877916, 0.02120360,
0.02207263, 0.02807019, 0.05432927, 0.59612077, 0.02482168, 0.14582400,
0.02371670, 0.02608826, 0.06852409, 0.07473471, 0.07160314, 0.02040747,
0.09878789, 0.02421807))
setkey(dta, score) # order by score
treated_nbr <- dta[treatment == 1, .N] # just to simplify the next line
selecteddata <-
dta[treatment == 0,
.SD[(.N - treated_nbr + 1):.N,
.(correspid = id,
correspscore = score,
id = dta[treatment == 1, id])]]
here we take the same number of ordered non treated person (.N-treated_nbr+1):.N
) so that they have the closest score to the ordered one, and we merge the id to the id of the treated one (id = dta[,.SD[treatment == 1,id]]
)
setkey(selecteddata, id)
setkey(dta, id)
selecteddata[dta] # do the merging
Not sure it is exactly what you want, because I realized it works only if your treated scores are higher than the not treated ones (which is the case in your example). You could add a condition to use the solution proposed only for treated person with score higher than the non treated ones, and do the rest otherwise (I don't see a direct simple solution otherwise)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With