Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding subvector of maximum length containing a small proportion of 0's

Tags:

r

count

vector

I have a vector that contains a sequence of 1 and 0. Suppose of it is of length 166 and it is

  y <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1, 1,1,1,1,1,0,1,1,0,1,0,1,0,0,0,0,0,1,0,0,0,1,1,0,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,0,1,1,0,1,1,1,0,0,0,0,0,1,1,1,1)

Now I want to to extract a LONGEST POSSIBLE sub vector from above vector such that it satisfies two properties

(1) sub-vector should start from 1 and end with 1.

(2) It can contain up to 5% zeros of total length of sub-vector.

I started with rle function. It counts the 1 and 0 at each step. So it will be like

z <- rle(y)
d <- data.frame(z$values, z$lengths)
colnames(d) <- c("value", "length")

It gives me

> d
   value length
1      1     22
2      0      1
3      1     13
4      0      1
5      1      2
6      0      1
7      1      1
8      0      1
9      1      1
10     0      5
11     1      1
12     0      3
13     1      2
14     0      1
15     1      1
16     0      1
17     1     74
18     0      2
19     1     17
20     0      1
21     1      2
22     0      1
23     1      3
24     0      5
25     1      4

In this case 74 + 2+ 17 + 1 + 2 + 3 = 99 is the required sub-sequence as it contains 2+1+1=4 zeros which is less than 5% of 99. If we move forward and sequence will become 99+5+4 =108 and zeros will be 4+5=9 which will be more than 5% of 108.

like image 923
Pankaj Avatar asked Mar 08 '16 18:03

Pankaj


1 Answers

I think you are very close by computing the run-length encoding of this vector. All that remains is to consider all pairs of runs of 1's and to select the pair that is of the longest length and matches the "no more than 5% zeros" rule. This can be done in a fully vectorized manner using combn to compute all pairs of runs of 1's and cumsum to get lengths of runs from the rle output:

ones <- which(d$value == 1)
# pairs holds pairs of rows in d that correspond to runs of 1's
if (length(ones) >= 2) {
  pairs <- rbind(t(combn(ones, 2)), cbind(ones, ones))
} else if (length(ones) == 1) {
  pairs <- cbind(ones, ones)
}

# Taking cumulative sums of the run lengths enables vectorized computation of the lengths
#   of each run in the "pairs" matrix
cs <- cumsum(d$length)
pair.length <- cs[pairs[,2]] - cs[pairs[,1]] + d$length[pairs[,1]]
cs0 <- cumsum(d$length * (d$value == 0))
pair.num0 <- cs0[pairs[,2]] - cs0[pairs[,1]]

# Multiple the length of a pair by an indicator for whether it's valid and take the max
selected <- which.max(pair.length * ((pair.num0 / pair.length) <= 0.05))
d[pairs[selected,1]:pairs[selected,2],]
#    value length
# 15     1      1
# 16     0      1
# 17     1     74
# 18     0      2
# 19     1     17
# 20     0      1
# 21     1      2
# 22     0      1
# 23     1      3

We actually found a subvector that is slightly longer that the one found by the OP: it has 102 elements and five 0's (4.90%).

like image 125
josliber Avatar answered Oct 23 '22 10:10

josliber