I want to generate missing values in a vector so that the missing value are grouped in sequences, to simulate periods of missing data of different length.
Let's say I have a vector of 10 000 values and I want to generate 12 sequences of NA at random locations in the vector, each sequence having a random length L
between 1 and 144 (144 simulates 2 days of missing values at timestep 10 minutes). The sequences must not overlap.
How can I do that? Thanks.
I tried combining lapply
and seq
without success.
An example expected output with 3 distinct sequences:
# 1 2 3 5 2 NA NA 5 4 6 8 9 10 11 NA NA NA NA NA NA 5 2 NA NA NA...
EDIT
I'm dealing with a seasonal time series so the NA must overwrite values and not be inserted as new elements.
All other answers more or less follow a "conditional specification" where starting index and run length of the NA chunks are simulated. However, as non-overlapping condition must be satisfied these chunks have to be determined one by one. Such dependence prohibits vectorization, and either for
loop or lapply / sapply
must be used.
However, this problem is just another run length problem. 12 non-overlapping NA chunks would divide the whole sequence into 13 non-missing chunks (yep, I guess this is what OP wants as missing chunks occurring as the first chunk or the last chunk is not interesting). So why not think of the following:
The second step looks difficult as it must satisfy that length of all chunks sums up to a fixed number. Well, multinomial distribution is just for this.
So here is a fully vectorized solution:
# run length of 12 missing chunks, with feasible length between 1 and 144
k <- sample.int(144, 12, TRUE)
# run length of 13 non-missing chunks, summing up to `10000 - sum(k)`
# equal probability is used as an example, you may try something else
m <- c(rmultinom(1, 10000 - sum(k), prob = rep.int(1, 13)))
# interleave `m` and `k`
n <- c(rbind(m[1:12], k), m[13])
# reference value: 1 for non-missing and NA for missing, and interleave them
ref <- c(rep.int(c(1, NA), 12), 1)
# an initial vector
vec <- rep.int(ref, n)
# missing index
miss <- is.na(vec)
We can verify that sum(n)
is 10000. What's next? Feel free to fill in non-missing entries with random integers maybe?
My initial answer may be too short to follow, thus the above expansion is taken.
It is straightforward to write a function implementing the above, with user input, in place of example parameter values 12, 144, 10000.
Note, the only potential problem of multinomial, is that under some bad prob
, it could generate some zeros. Thus, some NA chunks will in fact join together. To get around this, a robust check is as such: replace all 0 to 1, and subtract the inflation of such change from the max(m)
.
If both the starting position and the run-length of each NA-sequence is supposed to be random I think you cannot be sure to immediately find a fitting solution, since your constraint is that the sequences must not overlap.
Therefore I propose the following solution which tries up to a limited number of times (max_iter
) to find a fitting combination of starting positions and NA-run-lengths. If one is found, it is returned, if none is found within the defined maximum number of iterations, you'll just get a notice returned.
x = 1:1000
n = 3
m = 1:144
f <- function(x, n, m, max_iter = 100) {
i = 0
repeat {
i = i+1
idx <- sort(sample(seq_along(x), n)) # starting positions
dist <- diff(c(idx, length(x))) # check distance inbetween
na_len <- sample(m, n, replace = TRUE) - 1L # lengths of NA-runs
ok <- all(na_len < dist) # check overlap
if(ok | i == max_iter) break
}
if(ok) {
replace(x, unlist(Map(":", idx, idx+na_len)), NA)
} else {
cat("no solution found in", max_iter, "iterations")
}
}
f(x, n, m, max_iter = 20)
Of course you can increase the number of iterations easily and you should note that with larger n
it's increasingly difficult (more iterations required) to find a solution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With