I wish to remove values in a time series which are surrounded by blocks of NA
of a certain minimal length.
Some toy data:
x = seq(0,10,length.out = 100)
y = sin(x) + rnorm(length(x), mean=0, sd=0.1)
y[20:21] = rep(NA, 2)
y[50:54] = rep(NA, 5)
y[55:59] = seq(-0.1, -0.8, length.out = 5)
y[60:64] = rep(NA, 5)
y[90:91] = rep(NA, 2)
df <- data.frame(x, y)
I wish to remove any sequence of y values which is less than 10 in length and which is preceeded and followed by 5 or more NA
values.
In my toy data, the y values at index 55-59 has (a) less than 10 consecutive values, and have (b) 5 NA
on both sides. Thus, this block of values should be removed.
The other values consists of longer blocks of values and/or are surrounded by short runs of NA
(< 5) and should be kept.
Plot with the values to be removed in red color:
library(ggplot2)
ggplot(data = df, aes(x, y)) +
geom_line() +
geom_line(data = df[55:59, ], color = "red")
First, we will define the two thresholds you specified. (I set the second one to 4 so we can work consistently with "<" and ">", instead of the error-prone "<" and ">=").
threshold.data <- 10
threshold.NA <- 4
Now, the key is to work with run length encoding on is.na(y)
. Look at ?rle
.
foo <- rle(is.na(y))
foo
First, we extract possible "candidate runs of NAs" by checking where the original data are NA
(thus foo$values
will be TRUE
) and we have the specified minimum run length of NA
s:
candidate.runs.NA <- which(foo$values & foo$lengths>threshold.NA)
We only want to proceed if we have at least two NA
runs over the threshold:
if ( diff(range(candidate.runs.NA)) >= 2 ) {
Our goal is to find the indices of the non-NA
data that we want to remove. For this, we find "candidate runs of (non-NA
) data". In a first step, that includes all runs between the first and the last NA
run identified above:
candidate.runs.data <- seq(candidate.runs.NA[1]+1,tail(candidate.runs.NA,1)-1)
We refine this by two criteria. On the one hand, we only want sequences of non-NA
s, and on the other hand, these sequences should be below the threshold in length:
candidate.runs.data <- candidate.runs.data[!foo$values[candidate.runs.data] &
foo$lengths[candidate.runs.data]<threshold.data]
In your example, candidate.runs.data
will now have only one entry 5. This means that we need to remove all data in the 5th run of our is.na
sequence. For this, we need to restore the actual indices:
indices.to.remove <- as.vector(sapply(candidate.runs.data,function(kk)
seq(sum(foo$lengths[1:(kk-1)])+1,sum(foo$lengths[1:kk]))))
This is a bit complicated, since I wrapped it in an sapply()
call, in case we get multiple candidate.runs.data
to remove. Finally, we remove these data:
y[indices.to.remove] <- NA
}
plot(x,y,"l")
Now, this seems to do what you want for your specific example. You may want to think about what you want to happen in boundary cases. For instance, this assumes that your series starts with a non-NA
. And what should happen if you don't have two runs of five or more NA
s, but three, or five? With or without shorter NA
runs between the "long" runs? This script will consider any run of up to nine non-NA
s between the first and the last "long" run as fair game.
You can treat your time series as a character string and use advantages of regular expressions here. It's easy to solve the problem with the help of function str_locate_all
from stringr
package.
st <- paste0(as.integer(is.na(df$y)), collapse = '')
# [1] "0000000000000000000110000000000000000000000000000111110000011111000000000000000000000000011000000000"
require("stringr")
str_locate_all(st, "1{5,}0{,10}1{5,}")
# pattern of at least 5 ones, then not more than 10 zeros, then again not less than 5 ones
# output will be:
# [[1]]
# start end
# [1,] 50 64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With