Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replacing NAs in R with nearest value

Tags:

r

missing-data

na

I'm looking for something similar to na.locf() in the zoo package, but instead of always using the previous non-NA value I'd like to use the nearest non-NA value. Some example data:

dat <- c(1, 3, NA, NA, 5, 7) 

Replacing NA with na.locf (3 is carried forward):

library(zoo) na.locf(dat) # 1 3 3 3 5 7 

and na.locf with fromLast set to TRUE (5 is carried backwards):

na.locf(dat, fromLast = TRUE) # 1 3 5 5 5 7 

But I wish the nearest non-NA value to be used. In my example this means that the 3 should be carried forward to the first NA, and the 5 should be carried backwards to the second NA:

1 3 3 5 5 7 

I have a solution coded up, but wanted to make sure that I wasn't reinventing the wheel. Is there something already floating around?

FYI, my current code is as follows. Perhaps if nothing else, someone can suggest how to make it more efficient. I feel like I'm missing an obvious way to improve this:

  na.pos <- which(is.na(dat))   if (length(na.pos) == length(dat)) {     return(dat)   }   non.na.pos <- setdiff(seq_along(dat), na.pos)   nearest.non.na.pos <- sapply(na.pos, function(x) {     return(which.min(abs(non.na.pos - x)))   })   dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]] 

To answer smci's questions below:

  1. No, any entry can be NA
  2. If all are NA, leave them as is
  3. No. My current solution defaults to the lefthand nearest value, but it doesn't matter
  4. These rows are a few hundred thousand elements typically, so in theory the upper bound would be a few hundred thousand. In reality it'd be no more than a few here & there, typically a single one.

Update So it turns out that we're going in a different direction altogether but this was still an interesting discussion. Thanks all!

like image 201
geoffjentry Avatar asked Apr 09 '12 17:04

geoffjentry


People also ask

How do I replace NAs in R?

The classic way to replace NA's in R is by using the IS.NA() function. The IS.NA() function takes a vector or data frame as input and returns a logical object that indicates whether a value is missing (TRUE or VALUE). Next, you can use this logical object to create a subset of the missing values and assign them a zero.

How do I replace NAs with 0 in R?

To replace NA with 0 in an R data frame, use is.na() function and then select all those values with NA and assign them to 0. myDataframe is the data frame in which you would like replace all NAs with 0.

How do I replace all missing values with NA in R?

So, how do you replace missing values with basic R code? To replace the missing values, you first identify the NA's with the is.na() function and the $-operator. Then, you use the min() function to replace the NA's with the lowest value.

How do you replace missing values in average in R?

The easiest way to replace NA's in an R data frame is by using the replace_na() function and the mean() function. The first function identifies the missing values, whereas the latter replaces the NA's with the mean.


1 Answers

Here is a very fast one. It uses findInterval to find what two positions should be considered for each NA in your original data:

f1 <- function(dat) {   N <- length(dat)   na.pos <- which(is.na(dat))   if (length(na.pos) %in% c(0, N)) {     return(dat)   }   non.na.pos <- which(!is.na(dat))   intervals  <- findInterval(na.pos, non.na.pos,                              all.inside = TRUE)   left.pos   <- non.na.pos[pmax(1, intervals)]   right.pos  <- non.na.pos[pmin(N, intervals+1)]   left.dist  <- na.pos - left.pos   right.dist <- right.pos - na.pos    dat[na.pos] <- ifelse(left.dist <= right.dist,                         dat[left.pos], dat[right.pos])   return(dat) } 

And here I test it:

# sample data, suggested by @JeffAllen dat <- as.integer(runif(50000, min=0, max=10)) dat[dat==0] <- NA  # computation times system.time(r0 <- f0(dat))    # your function # user  system elapsed  # 5.52    0.00    5.52 system.time(r1 <- f1(dat))    # this function # user  system elapsed  # 0.01    0.00    0.03 identical(r0, r1) # [1] TRUE 
like image 172
flodel Avatar answered Sep 20 '22 14:09

flodel