Conditionally replace missing values depending on surrounding non-missing values

Question

I am trying to replace missing values (NA) in a vector. NA between two equal number is replaced by that number. NA between two different values, should stay NA. For example, given vector "a", I want it to be "b".

a = c(1, NA, NA, NA, 1, NA, NA, NA, 2, NA, NA, 2, 3, NA, NA, 3)
b = c(1, 1, 1, 1, 1, NA, NA, NA, 2, 2, 2, 2, 3, 3, 3, 3)

As you can see, the second run of NA, between the values 1 and 2, is not replaced.

Is there a way to vectorize the calculation?

David Arenburg · Accepted Answer

OP asked for a vecgorized solution, so here's a possible vectorized base R solution (without for loops) that also handles situations with leading/lagging NAs

# Define a vector with Leading/Lagging NAs
a <- c(NA, NA, 1, NA, NA, NA, 1, NA, NA, NA, 2, NA, NA, 2, 3, NA, NA, 3, NA, NA)

# Save the boolean vector as we are going to reuse it a lot
na_vals <- is.na(a)

# Find the NAs location compared to the non-NAs
ind <- findInterval(which(na_vals), which(!na_vals))

# Find the consecutive values that equal
ind2 <- which(!diff(a[!na_vals]))

# Fill only NAs between equal consequtive files
a[na_vals] <- a[!na_vals][ind2[match(ind, ind2)]]
a
# [1] NA NA  1  1  1  1  1 NA NA NA  2  2  2  2  3  3  3  3 NA NA

Some time comparisons for big vectors

# Create a big vector
set.seed(123)
a <- sample(c(NA, 1:5), 5e7, replace = TRUE)

############################################
##### Cainã Max Couto-Silva

fill_data <- function(vec) {

  for(l in unique(vec[!is.na(vec)])) {

    g <- which(vec %in% l)

    indexes <- list()

    for(i in 1:(length(g) - 1)) {
      indexes[[i]] <- (g[i]+1):(g[i+1]-1)
    }

    for(i in 1:(length(g) - 1)) { 
      if(all(is.na(vec[indexes[[i]]]))) {
        vec[indexes[[i]]] <- l
      }
    }
  }

  return(vec)
}

system.time(res <- fill_data(a))
#   user  system elapsed 
#  81.73    4.41   86.48 

############################################
##### Henrik

system.time({
  a_ap <- na.approx(a, na.rm = FALSE)
  a_locf <- na.locf(a, na.rm = FALSE)
  a[which(a_ap == a_locf)] <- a_ap[which(a_ap == a_locf)]
})
#  user  system elapsed 
# 12.55    3.39   15.98 

# Validate
identical(res, as.integer(a))
# [1] TRUE

############################################
##### David

## Recreate a as it been overridden
set.seed(123)
a <- sample(c(NA, 1:5), 5e7, replace = TRUE)

system.time({
  # Save the boolean vector as we are going to reuse it a lot
  na_vals <- is.na(a)

  # Find the NAs location compaed to the non-NAs
  ind <- findInterval(which(na_vals), which(!na_vals))

  # Find the consecutive values that equl
  ind2 <- which(!diff(a[!na_vals]))

  # Fill only NAs between equal consequtive files
  a[na_vals] <- a[!na_vals][ind2[match(ind, ind2)]]
})
# user  system elapsed 
# 3.39    0.71    4.13 

# Validate
identical(res, a)
# [1] TRUE

Henrik · Answer

You may use convenience functions from zoo package. Here we replace NA in the original vector where interpolated values (create by na.approx) equals the 'last observations carried forward' (created by na.locf):

library(zoo)
a_ap <- na.approx(a)
a_locf <- na.locf(a)
a[which(a_ap == a_locf)] <- a_ap[which(a_ap == a_locf)]
a
# [1]  1  1  1  1  1 NA NA NA  2  2  2  2  3  3  3  3

To account for leading and trailing NA, add na.rm = FALSE:

a <- c(NA, 1, NA, NA, NA, 1, NA, NA, NA, 2, NA, NA, 2, 3, NA, NA, 3, NA)

a_ap <- na.approx(a, na.rm = FALSE)
a_locf <- na.locf(a, na.rm = FALSE)
a[which(a_ap == a_locf)] <- a_ap[which(a_ap == a_locf)]
a
# [1] NA  1  1  1  1  1 NA NA NA  2  2  2  2  3  3  3  3 NA

Cainã Max Couto-Silva · Answer

You can make a function like that:

fill_data <- function(vec) {

  for(l in unique(vec[!is.na(vec)])) {

    g <- which(vec %in% l)

    indexes <- list()

    for(i in 1:(length(g) - 1)) {
      indexes[[i]] <- (g[i]+1):(g[i+1]-1)
    }

    for(i in 1:(length(g) - 1)) { 
      if(all(is.na(vec[indexes[[i]]]))) {
        vec[indexes[[i]]] <- l
      }
    }
  }

  return(vec)
}

Running function:

a = c(1, NA, NA, NA, 1, NA, NA, NA, 2, NA, NA, 2, 3, NA, NA, 3)

fill_data(a)
[1]  1  1  1  1  1 NA NA NA  2  2  2  2  3  3  3  3

If you have a vector with values in different places it also works:

ab = c(1, NA, NA, NA, 1, NA, NA, NA, 1, NA, 2, NA, NA, NA, 2, NA , 1, NA, 1, 3, NA, NA, 3)

fill_data(ab)
[1]  1  1  1  1  1  1  1  1  1 NA  2  2  2  2  2 NA  1  1  1  3  3  3  3

Explanation:

First, you find the unique non-NA values.

Then it takes the indexes of each unique non-NA value and acquires the values between them;

Then it tests if these values are all NAs and, if they are, replace them by the level's value.

Conditionally replace missing values depending on surrounding non-missing values

Tags:

r

vectorization

missing-data

na

Ziyan Xu

3 Answers

David Arenburg

Henrik

Cainã Max Couto-Silva

Recent Activity

Donate For Us

Conditionally replace missing values depending on surrounding non-missing values

Tags:

r

vectorization

missing-data

na

Ziyan Xu

3 Answers

David Arenburg

Henrik

Cainã Max Couto-Silva

Related questions

Recent Activity

Donate For Us