I have a large, wide data.table
(20m rows) keyed by a person ID but with lots of columns (~150) that have lots of null values. Each column is a recorded state / attribute that I wish to carry forward for each person. Each person may have anywhere from 10 to 10,000 observations and there are about 500,000 people in the set. Values from one person can not 'bleed' into the following person, so my solution must respect the person ID column and group appropriately.
For demonstration purposes - here's a very small sample input:
DT = data.table(
id=c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3),
aa=c("A", NA, "B", "C", NA, NA, "D", "E", "F", NA, NA, NA),
bb=c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
cc=c(1, NA, NA, NA, NA, 4, NA, 5, 6, NA, 7, NA)
)
It looks like this:
id aa bb cc
1: 1 A NA 1
2: 1 NA NA NA
3: 1 B NA NA
4: 1 C NA NA
5: 2 NA NA NA
6: 2 NA NA 4
7: 2 D NA NA
8: 2 E NA 5
9: 3 F NA 6
10: 3 NA NA NA
11: 3 NA NA 7
12: 3 NA NA NA
My expected output looks like this:
id aa bb cc
1: 1 A NA 1
2: 1 A NA 1
3: 1 B NA 1
4: 1 C NA 1
5: 2 NA NA NA
6: 2 NA NA 4
7: 2 D NA 4
8: 2 E NA 5
9: 3 F NA 6
10: 3 F NA 6
11: 3 F NA 7
12: 3 F NA 7
I've found a data.table
solution that works, but it's terribly slow on my large data sets:
DT[, na.locf(.SD, na.rm=FALSE), by=id]
I've found equivalent solutions using dplyr that are equally slow.
GRP = DT %>% group_by(id)
data.table(GRP %>% mutate_each(funs(blah=na.locf(., na.rm=FALSE))))
I was hopeful that I could come up with a rolling 'self' join using the data.table
functionality, but I just can't seem to get it right (I suspect I would need to use .N
but I just haven't figured it out).
At this point I'm thinking I'll have to write something in Rcpp to efficiently apply the grouped locf.
I'm new to R, but I'm not new to C++ - so I'm confident I can do it. I just feel like there should be an efficient way to do this in R using data.table
.
A very simple na.locf
can be built by forwarding (cummax
) the non-NA
indices ((!is.na(x)) * seq_along(x)
) and subsetting accordingly:
x = c(1, NA, NA, 6, 4, 5, 4, NA, NA, 2)
x[cummax((!is.na(x)) * seq_along(x))]
# [1] 1 1 1 6 4 5 4 4 4 2
This replicates na.locf
with an na.rm = TRUE
argument, to get na.rm = FALSE
behavior we simply need to make sure the first element in the cummax
is TRUE
:
x = c(NA, NA, 1, NA, 2)
x[cummax(c(TRUE, tail((!is.na(x)) * seq_along(x), -1)))]
#[1] NA NA 1 1 2
In this case, we need to take into account not only the non-NA
indices but, also, of the indices where the (ordered, or to be ordered) "id" column changes value:
id = c(10, 10, 11, 11, 11, 12, 12, 12, 13, 13)
c(TRUE, id[-1] != id[-length(id)])
# [1] TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
Combining the above:
id = c(10, 10, 11, 11, 11, 12, 12, 12, 13, 13)
x = c(1, NA, NA, 6, 4, 5, 4, NA, NA, 2)
x[cummax(((!is.na(x)) | c(TRUE, id[-1] != id[-length(id)])) * seq_along(x))]
# [1] 1 1 NA 6 4 5 4 4 NA 2
Note, that here we OR
the first element with TRUE
, i.e. make it equal to TRUE
, thus getting the na.rm = FALSE
behavior.
And for this example:
id_change = DT[, c(TRUE, id[-1] != id[-.N])]
DT[, lapply(.SD, function(x) x[cummax(((!is.na(x)) | id_change) * .I)])]
# id aa bb cc
# 1: 1 A NA 1
# 2: 1 A NA 1
# 3: 1 B NA 1
# 4: 1 C NA 1
# 5: 2 NA NA NA
# 6: 2 NA NA 4
# 7: 2 D NA 4
# 8: 2 E NA 5
# 9: 3 F NA 6
#10: 3 F NA 6
#11: 3 F NA 7
#12: 3 F NA 7
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With