Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conditionally replace missing values depending on surrounding non-missing values

I am trying to replace missing values (NA) in a vector. NA between two equal number is replaced by that number. NA between two different values, should stay NA. For example, given vector "a", I want it to be "b".

a = c(1, NA, NA, NA, 1, NA, NA, NA, 2, NA, NA, 2, 3, NA, NA, 3)
b = c(1, 1, 1, 1, 1, NA, NA, NA, 2, 2, 2, 2, 3, 3, 3, 3)

As you can see, the second run of NA, between the values 1 and 2, is not replaced.

Is there a way to vectorize the calculation?

like image 631
Ziyan Xu Avatar asked Apr 06 '18 15:04

Ziyan Xu


3 Answers

OP asked for a vecgorized solution, so here's a possible vectorized base R solution (without for loops) that also handles situations with leading/lagging NAs

# Define a vector with Leading/Lagging NAs
a <- c(NA, NA, 1, NA, NA, NA, 1, NA, NA, NA, 2, NA, NA, 2, 3, NA, NA, 3, NA, NA)

# Save the boolean vector as we are going to reuse it a lot
na_vals <- is.na(a)

# Find the NAs location compared to the non-NAs
ind <- findInterval(which(na_vals), which(!na_vals))

# Find the consecutive values that equal
ind2 <- which(!diff(a[!na_vals]))

# Fill only NAs between equal consequtive files
a[na_vals] <- a[!na_vals][ind2[match(ind, ind2)]]
a
# [1] NA NA  1  1  1  1  1 NA NA NA  2  2  2  2  3  3  3  3 NA NA

Some time comparisons for big vectors

# Create a big vector
set.seed(123)
a <- sample(c(NA, 1:5), 5e7, replace = TRUE)

############################################
##### Cainã Max Couto-Silva

fill_data <- function(vec) {

  for(l in unique(vec[!is.na(vec)])) {

    g <- which(vec %in% l)

    indexes <- list()

    for(i in 1:(length(g) - 1)) {
      indexes[[i]] <- (g[i]+1):(g[i+1]-1)
    }

    for(i in 1:(length(g) - 1)) { 
      if(all(is.na(vec[indexes[[i]]]))) {
        vec[indexes[[i]]] <- l
      }
    }
  }

  return(vec)
}

system.time(res <- fill_data(a))
#   user  system elapsed 
#  81.73    4.41   86.48 

############################################
##### Henrik

system.time({
  a_ap <- na.approx(a, na.rm = FALSE)
  a_locf <- na.locf(a, na.rm = FALSE)
  a[which(a_ap == a_locf)] <- a_ap[which(a_ap == a_locf)]
})
#  user  system elapsed 
# 12.55    3.39   15.98 

# Validate
identical(res, as.integer(a))
# [1] TRUE

############################################
##### David

## Recreate a as it been overridden
set.seed(123)
a <- sample(c(NA, 1:5), 5e7, replace = TRUE)

system.time({
  # Save the boolean vector as we are going to reuse it a lot
  na_vals <- is.na(a)

  # Find the NAs location compaed to the non-NAs
  ind <- findInterval(which(na_vals), which(!na_vals))

  # Find the consecutive values that equl
  ind2 <- which(!diff(a[!na_vals]))

  # Fill only NAs between equal consequtive files
  a[na_vals] <- a[!na_vals][ind2[match(ind, ind2)]]
})
# user  system elapsed 
# 3.39    0.71    4.13 

# Validate
identical(res, a)
# [1] TRUE
like image 183
David Arenburg Avatar answered Nov 15 '22 08:11

David Arenburg


You may use convenience functions from zoo package. Here we replace NA in the original vector where interpolated values (create by na.approx) equals the 'last observations carried forward' (created by na.locf):

library(zoo)
a_ap <- na.approx(a)
a_locf <- na.locf(a)
a[which(a_ap == a_locf)] <- a_ap[which(a_ap == a_locf)]
a
# [1]  1  1  1  1  1 NA NA NA  2  2  2  2  3  3  3  3

To account for leading and trailing NA, add na.rm = FALSE:

a <- c(NA, 1, NA, NA, NA, 1, NA, NA, NA, 2, NA, NA, 2, 3, NA, NA, 3, NA)

a_ap <- na.approx(a, na.rm = FALSE)
a_locf <- na.locf(a, na.rm = FALSE)
a[which(a_ap == a_locf)] <- a_ap[which(a_ap == a_locf)]
a
# [1] NA  1  1  1  1  1 NA NA NA  2  2  2  2  3  3  3  3 NA
like image 40
Henrik Avatar answered Nov 15 '22 09:11

Henrik


You can make a function like that:

fill_data <- function(vec) {

  for(l in unique(vec[!is.na(vec)])) {

    g <- which(vec %in% l)

    indexes <- list()

    for(i in 1:(length(g) - 1)) {
      indexes[[i]] <- (g[i]+1):(g[i+1]-1)
    }

    for(i in 1:(length(g) - 1)) { 
      if(all(is.na(vec[indexes[[i]]]))) {
        vec[indexes[[i]]] <- l
      }
    }
  }

  return(vec)
}

Running function:

a = c(1, NA, NA, NA, 1, NA, NA, NA, 2, NA, NA, 2, 3, NA, NA, 3)

fill_data(a)
[1]  1  1  1  1  1 NA NA NA  2  2  2  2  3  3  3  3

If you have a vector with values in different places it also works:

ab = c(1, NA, NA, NA, 1, NA, NA, NA, 1, NA, 2, NA, NA, NA, 2, NA , 1, NA, 1, 3, NA, NA, 3)

fill_data(ab)
[1]  1  1  1  1  1  1  1  1  1 NA  2  2  2  2  2 NA  1  1  1  3  3  3  3

Explanation:

First, you find the unique non-NA values.

Then it takes the indexes of each unique non-NA value and acquires the values between them;

Then it tests if these values are all NAs and, if they are, replace them by the level's value.

like image 27
Cainã Max Couto-Silva Avatar answered Nov 15 '22 09:11

Cainã Max Couto-Silva