I'd like to fill missing values with a "row distance" to the nearest non-NA value. In other words, how would I convert column x in this sample dataframe into column y?
# x y
#1 0 0
#2 NA 1
#3 0 0
#4 NA 1
#5 NA 2
#6 NA 1
#7 0 0
#8 NA 1
#9 NA 2
#10 NA 3
#11 NA 2
#12 NA 1
#13 0 0
I can't seem to find the right combination of dplyr group_by and mutate row_number() statements to do the trick. The various imputation packages that I've investigated are designed for more complicated scenarios where imputation is performed using statistics and other variables.
d<-data.frame(x=c(0,NA,0,rep(NA,3),0,rep(NA,5),0),y=c(0,1,0,1,2,1,0,1,2,3,2,1,0))
Python | Imputation using the KNNimputer() It is a more useful method which works on the basic approach of the KNN algorithm rather than the naive approach of filling all the values with mean or the median. In this approach, we specify a distance from the missing values which is also known as the K parameter.
How does one choose the 'best' imputation method in a given application? The standard approach is to select some observations, set their status to missing, impute them with different methods, and compare their prediction accuracy. That is, the imputed values are simply compared to the true ones that were masked.
We can use
d$z = sapply(seq_along(d$x), function(z) min(abs(z - which(!is.na(d$x)))))
# x y z
# 1 0 0 0
# 2 NA 1 1
# 3 0 0 0
# 4 NA 1 1
# 5 NA 2 2
# 6 NA 1 1
# 7 0 0 0
# 8 NA 1 1
# 9 NA 2 2
# 10 NA 3 3
# 11 NA 2 2
# 12 NA 1 1
# 13 0 0 0
If you want to do this in dplyr, you can just wrap the sapply
part in a mutate
.
d %>%
mutate(z = sapply(seq_along(x), function(z) min(abs(z - which(!is.na(x))))))
or, using also library(purrr)
(thanks to @Onyambu):
d %>% mutate(m=map_dbl(1:n(),~min(abs(.x-which(!is.na(x))))))
Here is a way using data.table
library(data.table)
setDT(d)
d[, out := pmin(cumsum(is.na(x)), rev(cumsum(is.na(x)))), by = rleid(is.na(x))]
d
# x y out
# 1: 0 0 0
# 2: NA 1 1
# 3: 0 0 0
# 4: NA 1 1
# 5: NA 2 2
# 6: NA 1 1
# 7: 0 0 0
# 8: NA 1 1
# 9: NA 2 2
#10: NA 3 3
#11: NA 2 2
#12: NA 1 1
#13: 0 0 0
For each group of NA
s we calculation the parallel minimum of cumsum(is.na(x))
and its reverse. That works because the values in the groups of all non-NA
s will be 0
. Call setDF(d)
if you want to continue with a data.frame
.
Instead of calculating cumsum(is.na(x))
twice, we could also do
d[, out := {
tmp <- cumsum(is.na(x))
pmin(tmp, rev(tmp))
}, by = rleid(is.na(x))]
This might give a performance gain, but I didn't test.
Using dplyr
syntax this would read
library(dplyr)
d %>%
group_by(grp = data.table::rleid(is.na(x))) %>%
mutate(out = pmin(cumsum(is.na(x)), rev(cumsum(is.na(x))))) %>%
ungroup()
# A tibble: 13 x 4
# x y grp out
# <dbl> <dbl> <int> <int>
# 1 0 0 1 0
# 2 NA 1 2 1
# 3 0 0 3 0
# 4 NA 1 4 1
# 5 NA 2 4 2
# 6 NA 1 4 1
# 7 0 0 5 0
# 8 NA 1 6 1
# 9 NA 2 6 2
#10 NA 3 6 3
#11 NA 2 6 2
#12 NA 1 6 1
#13 0 0 7 0
The same idea in base R
rle_x <- rle(is.na(d$x))
grp <- rep(seq_along(rle_x$lengths), times = rle_x$lengths)
transform(d, out = ave(is.na(x), grp, FUN = function(i) pmin(cumsum(i), rev(cumsum(i)))))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With