I have a data.table listing intervals during which patients were exposed and the start and stop times during which they were observed for exposures. Exposure happens over intervals during observation. I want to generate the opposite intervals in which the patients were unexposed.
The data I have are formatted like the following:
library(data.table)
DT = fread("
id t0 t s tn
1 1 2 4 15
1 1 6 7 15
1 1 10 12 15
2 4 5 7 44
2 4 9 15 44
2 4 17 35 44")
t0 is the start time of observation, t is the start of exposure, s is the end of exposure, and tn is the end of observation. An example might be exposure to extreme UV in workers. ID 1 begins work in the first month of 2000, then in the second month is exposed to extreme UV in normal working conditions, and this stops in the fourth month as cloud cover persisted for 2 months. The "s" ending time represents the clouds in the fourth month. They persist for 2 months in "safe" conditions until the 6th month when UV rises again for another 2 months. And so on...
I want to generate the intervals in which the exposure did not happen based on these data. That is, when the participant is in the "safe" conditions. The example output would be:
id t s
1 1 2
1 4 6
1 7 10
1 12 15
2 4 5
2 7 9
2 15 17
2 35 44
The first step I have is to set a data.table in the following manner:
nonexp <- DT[, .(t=c(t0[1], s), s=c(t, tn[1])), by=id]
but in my dataset of over 140,000 events, this is running VERY slow. I am working in a remote compute environment that is heavily moderated, so I cannot tell whether the system is running slowly or my code is bad.
Is this code obviously suboptimal in some important ways? Is there a faster way to do this?
I'd store the data so that time variables are not split over multiple columns:
# bounds table
bdDT = melt(unique(DT[, .(id, t0, tn)]), id = "id", value.name = "t")
bdDT[variable == "t0", status := "safe"]
bdDT[variable == "tn", status := "end"]
bdDT[, variable := NULL ]
# core table
treatDT = melt(DT, id="id", value.name = "t", meas = c("t", "s"))
treatDT[variable == "t", status := "treated"]
treatDT[variable == "s", status := "safe"]
treatDT[, variable := NULL ]
# stack
res = unique(rbind(treatDT, bdDT), by=c("id", "t"))
setkey(res, id, t)
The data now looks like
id t status
1: 1 1 safe
2: 1 2 treated
3: 1 4 safe
4: 1 6 treated
5: 1 7 safe
6: 1 10 treated
7: 1 12 safe
8: 1 15 end
9: 2 4 safe
10: 2 5 treated
11: 2 7 safe
12: 2 9 treated
13: 2 15 safe
14: 2 17 treated
15: 2 35 safe
16: 2 44 end
From here, if you want to browse safe spells, there's...
> res[status == "safe"][res[status != "safe"], on=.(id, t), roll=TRUE,
.(id, start = x.t, end = i.t)
]
id start end
1: 1 1 2
2: 1 4 6
3: 1 7 10
4: 1 12 15
5: 2 4 5
6: 2 7 9
7: 2 15 17
8: 2 35 44
Note: if some treatment goes all the way to tn, it will not show up here as a zero-length spell.
Alternately, if you have enough RAM, a cleaner way is to expand the data...
idDT = unique(DT[, .(id, start = t0, end = tn)], by="id")
fullDT = idDT[, .(t = start:end), by=id]
fullDT[, status := "safe"]
fullDT[DT, on=.(id, t >= t, t < s), status := "treated"]
From there, you can collapse to spells for easier browsing
fullDT[,
.(start = first(t), end = last(t))
, by=.(id, status, g = rleid(id, status))][, !"g"][,
end := replace(end + 1L, .N, last(end))
, by=id][]
id status start end
1: 1 safe 1 2
2: 1 treated 2 4
3: 1 safe 4 6
4: 1 treated 6 7
5: 1 safe 7 10
6: 1 treated 10 12
7: 1 safe 12 15
8: 2 safe 4 5
9: 2 treated 5 7
10: 2 safe 7 9
11: 2 treated 9 15
12: 2 safe 15 17
13: 2 treated 17 35
14: 2 safe 35 44
The replace step is needed because the OP writes overlapping end and start dates.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With