Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove values which are surrounded by a certain number of NAs

I wish to remove values in a time series which are surrounded by blocks of NA of a certain minimal length.

Some toy data:

x = seq(0,10,length.out = 100)
y = sin(x) + rnorm(length(x), mean=0, sd=0.1)
y[20:21] = rep(NA, 2)
y[50:54] = rep(NA, 5)
y[55:59] = seq(-0.1, -0.8, length.out = 5)
y[60:64] = rep(NA, 5)
y[90:91] = rep(NA, 2)

df <- data.frame(x, y)

I wish to remove any sequence of y values which is less than 10 in length and which is preceeded and followed by 5 or more NA values.

In my toy data, the y values at index 55-59 has (a) less than 10 consecutive values, and have (b) 5 NA on both sides. Thus, this block of values should be removed.

The other values consists of longer blocks of values and/or are surrounded by short runs of NA (< 5) and should be kept.

Plot with the values to be removed in red color:

library(ggplot2)
ggplot(data = df, aes(x, y)) +
  geom_line() +
  geom_line(data = df[55:59, ], color = "red")

enter image description here

like image 896
mrdevlar Avatar asked Apr 24 '15 10:04

mrdevlar


2 Answers

First, we will define the two thresholds you specified. (I set the second one to 4 so we can work consistently with "<" and ">", instead of the error-prone "<" and ">=").

threshold.data <- 10
threshold.NA <- 4

Now, the key is to work with run length encoding on is.na(y). Look at ?rle.

foo <- rle(is.na(y))
foo

First, we extract possible "candidate runs of NAs" by checking where the original data are NA (thus foo$values will be TRUE) and we have the specified minimum run length of NAs:

candidate.runs.NA <- which(foo$values & foo$lengths>threshold.NA)

We only want to proceed if we have at least two NA runs over the threshold:

if ( diff(range(candidate.runs.NA)) >= 2 ) {

Our goal is to find the indices of the non-NA data that we want to remove. For this, we find "candidate runs of (non-NA) data". In a first step, that includes all runs between the first and the last NA run identified above:

    candidate.runs.data <- seq(candidate.runs.NA[1]+1,tail(candidate.runs.NA,1)-1)

We refine this by two criteria. On the one hand, we only want sequences of non-NAs, and on the other hand, these sequences should be below the threshold in length:

    candidate.runs.data <- candidate.runs.data[!foo$values[candidate.runs.data] &
      foo$lengths[candidate.runs.data]<threshold.data]

In your example, candidate.runs.data will now have only one entry 5. This means that we need to remove all data in the 5th run of our is.na sequence. For this, we need to restore the actual indices:

    indices.to.remove <- as.vector(sapply(candidate.runs.data,function(kk)
      seq(sum(foo$lengths[1:(kk-1)])+1,sum(foo$lengths[1:kk]))))

This is a bit complicated, since I wrapped it in an sapply() call, in case we get multiple candidate.runs.data to remove. Finally, we remove these data:

    y[indices.to.remove] <- NA
}
plot(x,y,"l")

enter image description here

Now, this seems to do what you want for your specific example. You may want to think about what you want to happen in boundary cases. For instance, this assumes that your series starts with a non-NA. And what should happen if you don't have two runs of five or more NAs, but three, or five? With or without shorter NA runs between the "long" runs? This script will consider any run of up to nine non-NAs between the first and the last "long" run as fair game.

like image 164
Stephan Kolassa Avatar answered Sep 18 '22 13:09

Stephan Kolassa


You can treat your time series as a character string and use advantages of regular expressions here. It's easy to solve the problem with the help of function str_locate_all from stringr package.

st <- paste0(as.integer(is.na(df$y)), collapse = '')
# [1] "0000000000000000000110000000000000000000000000000111110000011111000000000000000000000000011000000000"
require("stringr")
str_locate_all(st, "1{5,}0{,10}1{5,}") 
# pattern of at least 5 ones, then not more than 10 zeros, then again not less than 5 ones

# output will be:
# [[1]]
#      start end
# [1,]    50  64
like image 30
inscaven Avatar answered Sep 20 '22 13:09

inscaven