how to impute the distance to a value

Tags:

imputation

I'd like to fill missing values with a "row distance" to the nearest non-NA value. In other words, how would I convert column x in this sample dataframe into column y?

Click to copy

#    x y
#1   0 0
#2  NA 1
#3   0 0
#4  NA 1
#5  NA 2
#6  NA 1
#7   0 0
#8  NA 1
#9  NA 2
#10 NA 3
#11 NA 2
#12 NA 1
#13  0 0

I can't seem to find the right combination of dplyr group_by and mutate row_number() statements to do the trick. The various imputation packages that I've investigated are designed for more complicated scenarios where imputation is performed using statistics and other variables.

Click to copy

d<-data.frame(x=c(0,NA,0,rep(NA,3),0,rep(NA,5),0),y=c(0,1,0,1,2,1,0,1,2,3,2,1,0))

715

asked Dec 21 '18 18:12

Dan Strobridge

2 Answers

We can use

Click to copy

d$z = sapply(seq_along(d$x), function(z) min(abs(z - which(!is.na(d$x)))))
#     x y z
# 1   0 0 0
# 2  NA 1 1
# 3   0 0 0
# 4  NA 1 1
# 5  NA 2 2
# 6  NA 1 1
# 7   0 0 0
# 8  NA 1 1
# 9  NA 2 2
# 10 NA 3 3
# 11 NA 2 2
# 12 NA 1 1
# 13  0 0 0

If you want to do this in dplyr, you can just wrap the sapply part in a mutate.

Click to copy

d %>%
   mutate(z = sapply(seq_along(x), function(z) min(abs(z - which(!is.na(x))))))

or, using also library(purrr) (thanks to @Onyambu):

Click to copy

d %>% mutate(m=map_dbl(1:n(),~min(abs(.x-which(!is.na(x))))))

174

answered Oct 10 '22 18:10

dww

Here is a way using data.table

Click to copy

library(data.table)
setDT(d)
d[, out := pmin(cumsum(is.na(x)), rev(cumsum(is.na(x)))), by = rleid(is.na(x))]
d
#     x y out
# 1:  0 0   0
# 2: NA 1   1
# 3:  0 0   0
# 4: NA 1   1
# 5: NA 2   2
# 6: NA 1   1
# 7:  0 0   0
# 8: NA 1   1
# 9: NA 2   2
#10: NA 3   3
#11: NA 2   2
#12: NA 1   1
#13:  0 0   0

For each group of NAs we calculation the parallel minimum of cumsum(is.na(x)) and its reverse. That works because the values in the groups of all non-NAs will be 0. Call setDF(d) if you want to continue with a data.frame.

Instead of calculating cumsum(is.na(x)) twice, we could also do

Click to copy

d[, out := {
  tmp <- cumsum(is.na(x))
  pmin(tmp, rev(tmp))
  }, by = rleid(is.na(x))]

This might give a performance gain, but I didn't test.

Using dplyr syntax this would read

Click to copy

library(dplyr)
d %>% 
  group_by(grp = data.table::rleid(is.na(x))) %>% 
  mutate(out = pmin(cumsum(is.na(x)), rev(cumsum(is.na(x))))) %>% 
  ungroup()
# A tibble: 13 x 4
#       x     y   grp   out
#   <dbl> <dbl> <int> <int>
# 1     0     0     1     0
# 2    NA     1     2     1
# 3     0     0     3     0
# 4    NA     1     4     1
# 5    NA     2     4     2
# 6    NA     1     4     1
# 7     0     0     5     0
# 8    NA     1     6     1
# 9    NA     2     6     2
#10    NA     3     6     3
#11    NA     2     6     2
#12    NA     1     6     1
#13     0     0     7     0

The same idea in base R

Click to copy

rle_x <- rle(is.na(d$x))
grp <- rep(seq_along(rle_x$lengths), times = rle_x$lengths)

transform(d, out = ave(is.na(x), grp, FUN = function(i) pmin(cumsum(i), rev(cumsum(i)))))

answered Oct 10 '22 18:10

markus

Related questions
                            
                                Drawing rectangles on top of image R shiny
                            
                                Connect to MySQL database with RMySQL
                            
                                What does the lambda calculus have to say about return values?
                            
                                Is ggplot2's continuous color scale incompatible with knitr's tikzDevice?
                            
                                Row limit for data.table in R using fread
                            
                                Rolling joins (LOCF) in Postgres
                            
                                Issue with randomForest & long vectors
                            
                                Background color in tabsetPanel in Shiny
                            
                                Convert RStudio presentation (.Rpres) to rmarkdown presentation (.Rmd)
                            
                                Comparison of R, statmodels, sklearn for a classification task with logistic regression
                            
                                LaTeX formula in Shiny panel
                            
                                roxygen2 (Version 5.0) incorrectly creates documentation when #' occurs inside function
                            
                                Running python/bash code in Rstudio
                            
                                ggplot2 and Shiny: how to scale the size of legend with figure size?
                            
                                How to adjust x-axis using plot() when range is changing daily?
                            
                                r - How to specify the path in normalizePath, or get around this error associated with it?
                            
                                Setup R alert when long process is Finished
                            
                                variable scope in R tryCatch block: is <<- necessary to change local variable defined before tryCatch?
                            
                                prevent plot_ly reordering matrix
                            
                                Unexpected match of regex

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to impute the distance to a value

Tags:

r

imputation

Dan Strobridge

People also ask

2 Answers

dww

markus

Recent Activity

Donate For Us