Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

generate random sequences of NA of random lengths in a vector

I want to generate missing values in a vector so that the missing value are grouped in sequences, to simulate periods of missing data of different length.

Let's say I have a vector of 10 000 values and I want to generate 12 sequences of NA at random locations in the vector, each sequence having a random length L between 1 and 144 (144 simulates 2 days of missing values at timestep 10 minutes). The sequences must not overlap.

How can I do that? Thanks.

I tried combining lapply and seq without success.

An example expected output with 3 distinct sequences:

# 1 2 3 5 2 NA NA 5 4 6 8 9 10 11 NA NA NA NA NA NA 5 2 NA NA NA...

EDIT

I'm dealing with a seasonal time series so the NA must overwrite values and not be inserted as new elements.

like image 758
agenis Avatar asked Jun 16 '17 13:06

agenis


2 Answers

All other answers more or less follow a "conditional specification" where starting index and run length of the NA chunks are simulated. However, as non-overlapping condition must be satisfied these chunks have to be determined one by one. Such dependence prohibits vectorization, and either for loop or lapply / sapply must be used.

However, this problem is just another run length problem. 12 non-overlapping NA chunks would divide the whole sequence into 13 non-missing chunks (yep, I guess this is what OP wants as missing chunks occurring as the first chunk or the last chunk is not interesting). So why not think of the following:

  • generate run length of 12 missing chunks;
  • generate run length of 13 non-missing chunks;
  • interleave these two type of chunks.

The second step looks difficult as it must satisfy that length of all chunks sums up to a fixed number. Well, multinomial distribution is just for this.

So here is a fully vectorized solution:

# run length of 12 missing chunks, with feasible length between 1 and 144
k <- sample.int(144, 12, TRUE)

# run length of 13 non-missing chunks, summing up to `10000 - sum(k)`
# equal probability is used as an example, you may try something else
m <- c(rmultinom(1, 10000 - sum(k), prob = rep.int(1, 13)))

# interleave `m` and `k`
n <- c(rbind(m[1:12], k), m[13])

# reference value: 1 for non-missing and NA for missing, and interleave them
ref <- c(rep.int(c(1, NA), 12), 1)

# an initial vector
vec <- rep.int(ref, n)

# missing index
miss <- is.na(vec)

We can verify that sum(n) is 10000. What's next? Feel free to fill in non-missing entries with random integers maybe?


My initial answer may be too short to follow, thus the above expansion is taken.

It is straightforward to write a function implementing the above, with user input, in place of example parameter values 12, 144, 10000.

Note, the only potential problem of multinomial, is that under some bad prob, it could generate some zeros. Thus, some NA chunks will in fact join together. To get around this, a robust check is as such: replace all 0 to 1, and subtract the inflation of such change from the max(m).

like image 197
Zheyuan Li Avatar answered Oct 26 '22 10:10

Zheyuan Li


If both the starting position and the run-length of each NA-sequence is supposed to be random I think you cannot be sure to immediately find a fitting solution, since your constraint is that the sequences must not overlap.

Therefore I propose the following solution which tries up to a limited number of times (max_iter) to find a fitting combination of starting positions and NA-run-lengths. If one is found, it is returned, if none is found within the defined maximum number of iterations, you'll just get a notice returned.

x = 1:1000
n = 3
m = 1:144

f <- function(x, n, m, max_iter = 100) {
  i = 0
  repeat {
    i = i+1
    idx <- sort(sample(seq_along(x), n))        # starting positions
    dist <- diff(c(idx, length(x)))             # check distance inbetween 
    na_len <- sample(m, n, replace = TRUE) - 1L # lengths of NA-runs
    ok <- all(na_len < dist)                    # check overlap
    if(ok | i == max_iter) break 
  }

  if(ok) {
    replace(x, unlist(Map(":", idx, idx+na_len)), NA)
  } else {
      cat("no solution found in", max_iter, "iterations")
    }
}

f(x, n, m, max_iter = 20)

Of course you can increase the number of iterations easily and you should note that with larger n it's increasingly difficult (more iterations required) to find a solution.

like image 36
talat Avatar answered Oct 26 '22 09:10

talat