I am trying to replace missing values (NA
) in a vector. NA
between two equal number is replaced by that number. NA
between two different values, should stay NA
. For example, given vector "a", I want it to be "b".
a = c(1, NA, NA, NA, 1, NA, NA, NA, 2, NA, NA, 2, 3, NA, NA, 3)
b = c(1, 1, 1, 1, 1, NA, NA, NA, 2, 2, 2, 2, 3, 3, 3, 3)
As you can see, the second run of NA
, between the values 1
and 2
, is not replaced.
Is there a way to vectorize the calculation?
OP asked for a vecgorized solution, so here's a possible vectorized base R solution (without for loops) that also handles situations with leading/lagging NAs
# Define a vector with Leading/Lagging NAs
a <- c(NA, NA, 1, NA, NA, NA, 1, NA, NA, NA, 2, NA, NA, 2, 3, NA, NA, 3, NA, NA)
# Save the boolean vector as we are going to reuse it a lot
na_vals <- is.na(a)
# Find the NAs location compared to the non-NAs
ind <- findInterval(which(na_vals), which(!na_vals))
# Find the consecutive values that equal
ind2 <- which(!diff(a[!na_vals]))
# Fill only NAs between equal consequtive files
a[na_vals] <- a[!na_vals][ind2[match(ind, ind2)]]
a
# [1] NA NA 1 1 1 1 1 NA NA NA 2 2 2 2 3 3 3 3 NA NA
Some time comparisons for big vectors
# Create a big vector
set.seed(123)
a <- sample(c(NA, 1:5), 5e7, replace = TRUE)
############################################
##### Cainã Max Couto-Silva
fill_data <- function(vec) {
for(l in unique(vec[!is.na(vec)])) {
g <- which(vec %in% l)
indexes <- list()
for(i in 1:(length(g) - 1)) {
indexes[[i]] <- (g[i]+1):(g[i+1]-1)
}
for(i in 1:(length(g) - 1)) {
if(all(is.na(vec[indexes[[i]]]))) {
vec[indexes[[i]]] <- l
}
}
}
return(vec)
}
system.time(res <- fill_data(a))
# user system elapsed
# 81.73 4.41 86.48
############################################
##### Henrik
system.time({
a_ap <- na.approx(a, na.rm = FALSE)
a_locf <- na.locf(a, na.rm = FALSE)
a[which(a_ap == a_locf)] <- a_ap[which(a_ap == a_locf)]
})
# user system elapsed
# 12.55 3.39 15.98
# Validate
identical(res, as.integer(a))
# [1] TRUE
############################################
##### David
## Recreate a as it been overridden
set.seed(123)
a <- sample(c(NA, 1:5), 5e7, replace = TRUE)
system.time({
# Save the boolean vector as we are going to reuse it a lot
na_vals <- is.na(a)
# Find the NAs location compaed to the non-NAs
ind <- findInterval(which(na_vals), which(!na_vals))
# Find the consecutive values that equl
ind2 <- which(!diff(a[!na_vals]))
# Fill only NAs between equal consequtive files
a[na_vals] <- a[!na_vals][ind2[match(ind, ind2)]]
})
# user system elapsed
# 3.39 0.71 4.13
# Validate
identical(res, a)
# [1] TRUE
You may use convenience functions from zoo
package. Here we replace NA
in the original vector where interpolated values (create by na.approx
) equals the 'last observations carried forward' (created by na.locf
):
library(zoo)
a_ap <- na.approx(a)
a_locf <- na.locf(a)
a[which(a_ap == a_locf)] <- a_ap[which(a_ap == a_locf)]
a
# [1] 1 1 1 1 1 NA NA NA 2 2 2 2 3 3 3 3
To account for leading and trailing NA
, add na.rm = FALSE
:
a <- c(NA, 1, NA, NA, NA, 1, NA, NA, NA, 2, NA, NA, 2, 3, NA, NA, 3, NA)
a_ap <- na.approx(a, na.rm = FALSE)
a_locf <- na.locf(a, na.rm = FALSE)
a[which(a_ap == a_locf)] <- a_ap[which(a_ap == a_locf)]
a
# [1] NA 1 1 1 1 1 NA NA NA 2 2 2 2 3 3 3 3 NA
You can make a function like that:
fill_data <- function(vec) {
for(l in unique(vec[!is.na(vec)])) {
g <- which(vec %in% l)
indexes <- list()
for(i in 1:(length(g) - 1)) {
indexes[[i]] <- (g[i]+1):(g[i+1]-1)
}
for(i in 1:(length(g) - 1)) {
if(all(is.na(vec[indexes[[i]]]))) {
vec[indexes[[i]]] <- l
}
}
}
return(vec)
}
Running function:
a = c(1, NA, NA, NA, 1, NA, NA, NA, 2, NA, NA, 2, 3, NA, NA, 3)
fill_data(a)
[1] 1 1 1 1 1 NA NA NA 2 2 2 2 3 3 3 3
If you have a vector with values in different places it also works:
ab = c(1, NA, NA, NA, 1, NA, NA, NA, 1, NA, 2, NA, NA, NA, 2, NA , 1, NA, 1, 3, NA, NA, 3)
fill_data(ab)
[1] 1 1 1 1 1 1 1 1 1 NA 2 2 2 2 2 NA 1 1 1 3 3 3 3
Explanation:
First, you find the unique non-NA values.
Then it takes the indexes of each unique non-NA value and acquires the values between them;
Then it tests if these values are all NAs and, if they are, replace them by the level's value.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With