Background: I'm trying to strip out a corpus where the speaker is identified. I've reduced the problem of removing a particular speaker from the corpuse to the following stream of 1,0, and NA (x). 0 means that person is speaking, 1 someone else is speaking, NA means that whoever was the last speaker is still speaking. Here's a visual example: <pre class="prettyprint"><code>0 1 S0: Hello, how are you today? 1 2 S1: I'm great thanks for asking! NA 3 I'm a little tired though! 0 4 S0: I'm sorry to hear that. Are you ready for our discussion? 1 5 S1: Yes, I have everything I need. NA 7 Let's begin. </code></pre> So from this frame, I'd like to take 2,3,5, and 7. Or,. I would like the result to be 0,1,1,0,1,1. How do I pull the positions of each run of 1 and NA up to the position before the next 0 in a vector. Here is an example, and my desired output: Example input: <pre class="prettyprint"><code>x <- c(NA,NA,NA,NA,0,1,0,1,NA,NA,NA,0,NA,NA,1,NA,NA,0) </code></pre> Example output: These are the positions that I want because they identify that "speaker 1" is talking (1, or 1 followed by NA up to the next 0) <pre class="prettyprint"><code>pos <- c(6,8,9,10,11,15,16,17) </code></pre> An alternative output would be a filling: <pre class="prettyprint"><code>fill <- c(0,0,0,0,0,1,0,1,1,1,1,0,0,0,1,1,1,0) </code></pre> Where the NA values of the previous 1 or 0 are filled up to the next new value.

<pre class="prettyprint"><code>s <- which(x==1); e <- c(which(x!=1),length(x)+1L); unlist(Map(seq,s,e[findInterval(s,e)+1L]-1L)); ## [1] 6 8 9 10 11 15 16 17 </code></pre> Every occurrence of a 1 in the input vector is the start of a sequence of position indexes applicable to speaker 1. We capture this in <code>s</code> with <code>which(x==1)</code>. For each start index, we must find the length of its containing sequence. The length is determined by the closest forward occurrence of a 0 (or, more generally, any non-NA value other than 1, if such was possible). Hence, we must first compute <code>which(x!=1)</code> to get these indexes. Because the final occurrence of a 1 may not have a forward occurrence of a 0, we must append an extra virtual index one unit past the end of the input vector, which is why we must call <code>c()</code> to combine <code>length(x)+1L</code>. We store this as <code>e</code>, reflecting that these are (potential) end indexes. Note that these are exclusive end indexes; they are not actually part of the (potential) preceding speaker 1 sequence. Finally, we must generate the actual sequences. To do this, we must make one call to <code>seq()</code> for each element of <code>s</code>, also passing its corresponding end index from <code>e</code>. To find the end index we can use <code>findInterval()</code> to find the index into <code>e</code> whose element value (that is, the end index into <code>x</code>) falls just before each respective element of <code>s</code>. (The reason why it is just before is that the algorithm used by <code>findInterval()</code> is <code>v[i[j]] ≤ x[j] < v[i[j]+1]</code> as explained on the doc page.) We must then add one to it to get the index into <code>e</code> whose element value falls just after each respective element of <code>s</code>. We then index <code>e</code> with it, giving us the end indexes into <code>x</code> that follow each respective element of <code>s</code>. We must subtract one from that because the sequence we generate must exclude the (exclusive) end element. The easiest way to make the calls to <code>seq()</code> is to <code>Map()</code> the two endpoint vectors to it, returning a list of each sequence, which we can <code>unlist()</code> to get the required output. <hr> <pre class="prettyprint"><code>s <- which(!is.na(x)); rep(c(0,x[s]),diff(c(1L,s,length(x)+1L))); ## [1] 0 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0 </code></pre> Every occurrence of a non-NA value in the input vector is the start of a segment which, in the output, must become a repetition of the element value at that start index. We capture these indexes in <code>s</code> with <code>which(!is.na(x));</code>. We must then repeat each start element a sufficient number of times to reach the following segment. Hence we can call <code>rep()</code> on <code>x[s]</code> with a vectorized <code>times</code> argument whose values consist of <code>diff()</code> called on <code>s</code>. To handle the final segment, we must append an index one unit past the end of the input vector, <code>length(x)+1L</code>. Also, to deal with the possible case of NAs leading the input vector, we must prepend a 0 to <code>x[s]</code> and a 1 to the <code>diff()</code> argument, which will repeat 0 a sufficient number of times to cover the leading NAs, if such exist. <hr> <h3>Benchmarking (Position)</h3> <pre class="prettyprint"><code>library(zoo); library(microbenchmark); library(stringi); marat <- function(x) { v <- na.locf(zoo(x)); index(v)[v==1]; }; rawr <- function(x) which(zoo::na.locf(c(0L, x))[-1L] == 1L); jota1 <- function(x) { stringx <- paste(x, collapse = ""); stringx <- gsub("NA", "N", stringx, fixed = TRUE); while(grepl("(?<=1)N", stringx, perl = TRUE)) stringx <- gsub("(?<=1)N", "1", stringx, perl = TRUE); unlist(gregexpr("1", stringx)); }; jota2 <- function(x) { stringx <- paste(x, collapse = ""); stringx <- gsub("NA", "N", stringx, fixed = TRUE); while(grepl("(?<=1)N", stringx, perl = TRUE)) stringx <- gsub("(?<=1)N", "1", stringx, perl = TRUE); newx <-unlist(strsplit(stringx, "")); which(newx == 1); }; jota3 <- function(x) {x[is.na(x)] <- "N"; stringx <- stri_flatten(x); ones <- stri_locate_all_regex(stringx, "1N*")[[1]]; unlist(lapply(seq_along(ones[, 1]), function(ii) seq.int(ones[ii, "start"], ones[ii, "end"]))); }; bgoldst <- function(x) { s <- which(x==1); e <- c(which(x!=1),length(x)+1L); unlist(Map(seq,s,e[findInterval(s,e)+1L]-1L)); }; </code></pre> <hr> <pre class="prettyprint"><code>## OP's test case x <- c(NA,NA,NA,NA,0,1,0,1,NA,NA,NA,0,NA,NA,1,NA,NA,0); ex <- marat(x); identical(ex,rawr(x)); ## [1] TRUE identical(ex,jota1(x)); ## [1] TRUE identical(ex,jota2(x)); ## [1] TRUE identical(ex,jota3(x)); ## [1] TRUE identical(ex,bgoldst(x)); ## [1] TRUE microbenchmark(marat(x),rawr(x),jota1(x),jota2(x),jota3(x),bgoldst(x)); ## Unit: microseconds ## expr min lq mean median uq max neval ## marat(x) 411.830 438.5580 503.24486 453.7400 489.2345 2299.915 100 ## rawr(x) 115.466 143.0510 154.64822 153.5280 163.7920 276.692 100 ## jota1(x) 448.180 469.7770 484.47090 479.6125 491.1595 835.633 100 ## jota2(x) 440.911 464.4315 478.03050 472.1290 484.3170 661.579 100 ## jota3(x) 53.885 65.4315 74.34808 71.2050 76.9785 158.232 100 ## bgoldst(x) 34.212 44.2625 51.54556 48.5395 55.8095 139.843 100 </code></pre> <hr> <pre class="prettyprint"><code>## scale test, high probability of NA set.seed(1L); N <- 1e5L; probNA <- 0.8; x <- sample(c(NA,T),N,T,c(probNA,1-probNA)); x[which(x)] <- rep(c(0,1),len=sum(x,na.rm=T)); ex <- marat(x); identical(ex,rawr(x)); ## [1] TRUE identical(ex,jota1(x)); ## [1] TRUE identical(ex,jota2(x)); ## [1] TRUE identical(ex,jota3(x)); ## [1] TRUE identical(ex,bgoldst(x)); ## [1] TRUE microbenchmark(marat(x),rawr(x),jota1(x),jota2(x),jota3(x),bgoldst(x)); ## Unit: milliseconds ## expr min lq mean median uq max neval ## marat(x) 189.34479 196.70233 226.72926 233.39234 237.45738 293.95154 100 ## rawr(x) 24.46984 27.46084 43.91167 29.92112 68.86464 79.53008 100 ## jota1(x) 154.91450 157.09231 161.73505 158.18326 160.42694 206.04889 100 ## jota2(x) 149.47561 151.68187 155.92497 152.93682 154.79668 201.13302 100 ## jota3(x) 82.30768 83.89149 87.35308 84.99141 86.95028 129.94730 100 ## bgoldst(x) 80.94261 82.94125 87.80780 84.02107 86.10844 130.56440 100 </code></pre> <hr> <pre class="prettyprint"><code>## scale test, low probability of NA set.seed(1L); N <- 1e5L; probNA <- 0.2; x <- sample(c(NA,T),N,T,c(probNA,1-probNA)); x[which(x)] <- rep(c(0,1),len=sum(x,na.rm=T)); ex <- marat(x); identical(ex,rawr(x)); ## [1] TRUE identical(ex,jota1(x)); ## [1] TRUE identical(ex,jota2(x)); ## [1] TRUE identical(ex,jota3(x)); ## [1] TRUE identical(ex,bgoldst(x)); ## [1] TRUE microbenchmark(marat(x),rawr(x),jota1(x),jota2(x),jota3(x),bgoldst(x)); ## Unit: milliseconds ## expr min lq mean median uq max neval ## marat(x) 178.93359 189.56032 216.68963 226.01940 234.06610 294.6927 100 ## rawr(x) 17.75869 20.39367 36.16953 24.44931 60.23612 79.5861 100 ## jota1(x) 100.10614 101.49238 104.11655 102.27712 103.84383 150.9420 100 ## jota2(x) 94.59927 96.04494 98.65276 97.20965 99.26645 137.0036 100 ## jota3(x) 193.15175 202.02810 216.68833 209.56654 227.94255 295.5672 100 ## bgoldst(x) 253.33013 266.34765 292.52171 292.18406 311.20518 387.3093 100 </code></pre> <hr> <h3>Benchmarking (Fill)</h3> <pre class="prettyprint"><code>library(microbenchmark); bgoldst <- function(x) { s <- which(!is.na(x)); rep(c(0,x[s]),diff(c(1L,s,length(x)+1L))); }; user31264 <- function(x) { x[is.na(x)]=2; x.rle=rle(x); val=x.rle$v; if (val[1]==2) val[1]=0; ind = (val==2); val[ind]=val[which(ind)-1]; rep(val,x.rle$l); }; </code></pre> <hr> <pre class="prettyprint"><code>## OP's test case x <- c(NA,NA,NA,NA,0,1,0,1,NA,NA,NA,0,NA,NA,1,NA,NA,0); ex <- bgoldst(x); identical(ex,user31264(x)); ## [1] TRUE microbenchmark(bgoldst(x),user31264(x)); ## Unit: microseconds ## expr min lq mean median uq max neval ## bgoldst(x) 10.264 11.548 14.39548 12.403 13.258 73.557 100 ## user31264(x) 31.646 32.930 35.74805 33.785 35.068 84.676 100 </code></pre> <hr> <pre class="prettyprint"><code>## scale test, high probability of NA set.seed(1L); N <- 1e5L; probNA <- 0.8; x <- sample(c(NA,T),N,T,c(probNA,1-probNA)); x[which(x)] <- rep(c(0,1),len=sum(x,na.rm=T)); ex <- bgoldst(x); identical(ex,user31264(x)); ## [1] TRUE microbenchmark(bgoldst(x),user31264(x)); ## Unit: milliseconds ## expr min lq mean median uq max neval ## bgoldst(x) 10.94491 11.21860 12.50473 11.53015 12.28945 50.25899 100 ## user31264(x) 17.18649 18.35634 22.50400 18.91848 19.53708 65.02668 100 </code></pre> <hr> <pre class="prettyprint"><code>## scale test, low probability of NA set.seed(1L); N <- 1e5L; probNA <- 0.2; x <- sample(c(NA,T),N,T,c(probNA,1-probNA)); x[which(x)] <- rep(c(0,1),len=sum(x,na.rm=T)); ex <- bgoldst(x); identical(ex,user31264(x)); ## [1] TRUE microbenchmark(bgoldst(x),user31264(x)); ## Unit: milliseconds ## expr min lq mean median uq max neval ## bgoldst(x) 5.24815 6.351279 7.723068 6.635454 6.923264 45.04077 100 ## user31264(x) 11.79423 13.063710 22.367334 13.986584 14.908603 55.45453 100 </code></pre>

You can make use of <code>na.locf</code> from the <code>zoo</code> package: <pre class="prettyprint"><code>library(zoo) x <- c(NA,NA,NA,NA,0,1,0,1,NA,NA,NA,0,NA,NA,1,NA,NA,0) v <- na.locf(zoo(x)) index(v)[v==1] #[1] 6 8 9 10 11 15 16 17 </code></pre>

Pasting the sequence into a string and using a while loop that checks (with <code>grep</code>) whether there are any <code>NA</code>s preceded by <code>1</code>s and substitutes (with <code>gsub</code>) such cases with a <code>1</code> will do it: <pre class="prettyprint"><code># substitute NA for "N" for later ease of processing and locating 1s by position x[is.na(x)] <- "N" # Collapse vector into a string stringx <- paste(x, collapse = "") while(grepl("(?<=1)N", stringx, perl = TRUE)) { stringx <- gsub("(?<=1)N", "1", stringx, perl = TRUE) } </code></pre> Then you can use <code>gregexpr</code> to get the indices of 1s. <pre class="prettyprint"><code>unlist(gregexpr("1", stringx)) #[1] 6 8 9 10 11 15 16 17 </code></pre> Or you can split the string and look through to find the indices of 1s in the resulting vector: <pre class="prettyprint"><code>newx <-unlist(strsplit(stringx, "")) #[1] "N" "N" "N" "N" "0" "1" "0" "1" "1" "1" "1" "0" "N" "N" "1" "1" "1" "0" which(newx == "1") #[1] 6 8 9 10 11 15 16 17 </code></pre> <hr> <hr> Using <code>stri_flatten</code> from the <code>stringi</code> package instead of <code>paste</code> and <code>stri_locate_all_fixed</code> rather than <code>gregexpr</code> or a string splitting route can provide a little bit more performance if it's a larger vector you're processing. If the vector isn't large, no performance gains will result. <pre class="prettyprint"><code>library(stringi) x[is.na(x)] <- "N" stringx <- stri_flatten(x) while(grepl("(?<=1)N", stringx, perl = TRUE)) { stringx <- gsub("(?<=1)N", "1", stringx, perl = TRUE) } stri_locate_all_fixed(stringx, "1")[[1]][,"start"] </code></pre> <hr> The following approach is fairly straightforward and performs relatively well (based on bgoldst's excellent benchmarking example) on small and large samples (very well on bgoldst's high probability of NA example) <pre class="prettyprint"><code>x[is.na(x)] <- "N" stringx <- stri_flatten(x) ones <- stri_locate_all_regex(stringx, "1N*")[[1]] #[[1]] # # start end #[1,] 6 6 #[2,] 8 11 #[3,] 15 17 unlist(lapply(seq_along(ones[, 1]), function(ii) seq.int(ones[ii, "start"], ones[ii, "end"]))) #[1] 6 8 9 10 11 15 16 17 </code></pre>

Create groups from vector of 0,1 and NA

Tags:

r

Background:

I'm trying to strip out a corpus where the speaker is identified. I've reduced the problem of removing a particular speaker from the corpuse to the following stream of 1,0, and NA (x). 0 means that person is speaking, 1 someone else is speaking, NA means that whoever was the last speaker is still speaking.

Here's a visual example:

0  1 S0: Hello, how are you today?
1  2 S1: I'm great thanks for asking!
NA 3 I'm a little tired though!
0  4 S0: I'm sorry to hear that. Are you ready for our discussion?
1  5 S1: Yes, I have everything I need.
NA 7 Let's begin.

So from this frame, I'd like to take 2,3,5, and 7. Or,. I would like the result to be 0,1,1,0,1,1.

How do I pull the positions of each run of 1 and NA up to the position before the next 0 in a vector.

Here is an example, and my desired output:

Example input:

x <- c(NA,NA,NA,NA,0,1,0,1,NA,NA,NA,0,NA,NA,1,NA,NA,0)

Example output:

These are the positions that I want because they identify that "speaker 1" is talking (1, or 1 followed by NA up to the next 0)

pos <- c(6,8,9,10,11,15,16,17)

An alternative output would be a filling:

fill <- c(0,0,0,0,0,1,0,1,1,1,1,0,0,0,1,1,1,0)

Where the NA values of the previous 1 or 0 are filled up to the next new value.

249

asked May 29 '16 02:05

Brandon Bertelsen

4 Answers

s <- which(x==1);
e <- c(which(x!=1),length(x)+1L);
unlist(Map(seq,s,e[findInterval(s,e)+1L]-1L));
## [1]  6  8  9 10 11 15 16 17

Every occurrence of a 1 in the input vector is the start of a sequence of position indexes applicable to speaker 1. We capture this in s with which(x==1).

For each start index, we must find the length of its containing sequence. The length is determined by the closest forward occurrence of a 0 (or, more generally, any non-NA value other than 1, if such was possible). Hence, we must first compute which(x!=1) to get these indexes. Because the final occurrence of a 1 may not have a forward occurrence of a 0, we must append an extra virtual index one unit past the end of the input vector, which is why we must call c() to combine length(x)+1L. We store this as e, reflecting that these are (potential) end indexes. Note that these are exclusive end indexes; they are not actually part of the (potential) preceding speaker 1 sequence.

Finally, we must generate the actual sequences. To do this, we must make one call to seq() for each element of s, also passing its corresponding end index from e. To find the end index we can use findInterval() to find the index into e whose element value (that is, the end index into x) falls just before each respective element of s. (The reason why it is just before is that the algorithm used by findInterval() is v[i[j]] ≤ x[j] < v[i[j]+1] as explained on the doc page.) We must then add one to it to get the index into e whose element value falls just after each respective element of s. We then index e with it, giving us the end indexes into x that follow each respective element of s. We must subtract one from that because the sequence we generate must exclude the (exclusive) end element. The easiest way to make the calls to seq() is to Map() the two endpoint vectors to it, returning a list of each sequence, which we can unlist() to get the required output.

s <- which(!is.na(x));
rep(c(0,x[s]),diff(c(1L,s,length(x)+1L)));
## [1] 0 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0

Every occurrence of a non-NA value in the input vector is the start of a segment which, in the output, must become a repetition of the element value at that start index. We capture these indexes in s with which(!is.na(x));.

We must then repeat each start element a sufficient number of times to reach the following segment. Hence we can call rep() on x[s] with a vectorized times argument whose values consist of diff() called on s. To handle the final segment, we must append an index one unit past the end of the input vector, length(x)+1L. Also, to deal with the possible case of NAs leading the input vector, we must prepend a 0 to x[s] and a 1 to the diff() argument, which will repeat 0 a sufficient number of times to cover the leading NAs, if such exist.

Benchmarking (Position)

library(zoo);
library(microbenchmark);
library(stringi);

marat <- function(x) { v <- na.locf(zoo(x)); index(v)[v==1]; };
rawr <- function(x) which(zoo::na.locf(c(0L, x))[-1L] == 1L);
jota1 <- function(x) { stringx <- paste(x, collapse = ""); stringx <- gsub("NA", "N", stringx, fixed = TRUE); while(grepl("(?<=1)N", stringx, perl = TRUE)) stringx <- gsub("(?<=1)N", "1", stringx, perl = TRUE); unlist(gregexpr("1", stringx)); };
jota2 <- function(x) { stringx <- paste(x, collapse = ""); stringx <- gsub("NA", "N", stringx, fixed = TRUE); while(grepl("(?<=1)N", stringx, perl = TRUE)) stringx <- gsub("(?<=1)N", "1", stringx, perl = TRUE); newx <-unlist(strsplit(stringx, "")); which(newx == 1); };
jota3 <- function(x) {x[is.na(x)] <- "N"; stringx <- stri_flatten(x); ones <- stri_locate_all_regex(stringx, "1N*")[[1]]; unlist(lapply(seq_along(ones[, 1]), function(ii) seq.int(ones[ii, "start"], ones[ii, "end"]))); };
bgoldst <- function(x) { s <- which(x==1); e <- c(which(x!=1),length(x)+1L); unlist(Map(seq,s,e[findInterval(s,e)+1L]-1L)); };

## OP's test case
x <- c(NA,NA,NA,NA,0,1,0,1,NA,NA,NA,0,NA,NA,1,NA,NA,0);

ex <- marat(x);
identical(ex,rawr(x));
## [1] TRUE
identical(ex,jota1(x));
## [1] TRUE
identical(ex,jota2(x));
## [1] TRUE
identical(ex,jota3(x));
## [1] TRUE
identical(ex,bgoldst(x));
## [1] TRUE

microbenchmark(marat(x),rawr(x),jota1(x),jota2(x),jota3(x),bgoldst(x));
## Unit: microseconds
##        expr     min       lq      mean   median       uq      max neval
##    marat(x) 411.830 438.5580 503.24486 453.7400 489.2345 2299.915   100
##     rawr(x) 115.466 143.0510 154.64822 153.5280 163.7920  276.692   100
##    jota1(x) 448.180 469.7770 484.47090 479.6125 491.1595  835.633   100
##    jota2(x) 440.911 464.4315 478.03050 472.1290 484.3170  661.579   100
##    jota3(x)  53.885  65.4315  74.34808  71.2050  76.9785  158.232   100
##  bgoldst(x)  34.212  44.2625  51.54556  48.5395  55.8095  139.843   100

## scale test, high probability of NA
set.seed(1L);
N <- 1e5L; probNA <- 0.8; x <- sample(c(NA,T),N,T,c(probNA,1-probNA)); x[which(x)] <- rep(c(0,1),len=sum(x,na.rm=T));

ex <- marat(x);
identical(ex,rawr(x));
## [1] TRUE
identical(ex,jota1(x));
## [1] TRUE
identical(ex,jota2(x));
## [1] TRUE
identical(ex,jota3(x));
## [1] TRUE
identical(ex,bgoldst(x));
## [1] TRUE

microbenchmark(marat(x),rawr(x),jota1(x),jota2(x),jota3(x),bgoldst(x));
## Unit: milliseconds
##        expr       min        lq      mean    median        uq       max neval
##    marat(x) 189.34479 196.70233 226.72926 233.39234 237.45738 293.95154   100
##     rawr(x)  24.46984  27.46084  43.91167  29.92112  68.86464  79.53008   100
##    jota1(x) 154.91450 157.09231 161.73505 158.18326 160.42694 206.04889   100
##    jota2(x) 149.47561 151.68187 155.92497 152.93682 154.79668 201.13302   100
##    jota3(x)  82.30768  83.89149  87.35308  84.99141  86.95028 129.94730   100
##  bgoldst(x)  80.94261  82.94125  87.80780  84.02107  86.10844 130.56440   100

## scale test, low probability of NA
set.seed(1L);
N <- 1e5L; probNA <- 0.2; x <- sample(c(NA,T),N,T,c(probNA,1-probNA)); x[which(x)] <- rep(c(0,1),len=sum(x,na.rm=T));

ex <- marat(x);
identical(ex,rawr(x));
## [1] TRUE
identical(ex,jota1(x));
## [1] TRUE
identical(ex,jota2(x));
## [1] TRUE
identical(ex,jota3(x));
## [1] TRUE
identical(ex,bgoldst(x));
## [1] TRUE

microbenchmark(marat(x),rawr(x),jota1(x),jota2(x),jota3(x),bgoldst(x));
## Unit: milliseconds
##        expr       min        lq      mean    median        uq      max neval
##    marat(x) 178.93359 189.56032 216.68963 226.01940 234.06610 294.6927   100
##     rawr(x)  17.75869  20.39367  36.16953  24.44931  60.23612  79.5861   100
##    jota1(x) 100.10614 101.49238 104.11655 102.27712 103.84383 150.9420   100
##    jota2(x)  94.59927  96.04494  98.65276  97.20965  99.26645 137.0036   100
##    jota3(x) 193.15175 202.02810 216.68833 209.56654 227.94255 295.5672   100
##  bgoldst(x) 253.33013 266.34765 292.52171 292.18406 311.20518 387.3093   100

Benchmarking (Fill)

library(microbenchmark);

bgoldst <- function(x) { s <- which(!is.na(x)); rep(c(0,x[s]),diff(c(1L,s,length(x)+1L))); };
user31264 <- function(x) { x[is.na(x)]=2; x.rle=rle(x); val=x.rle$v; if (val[1]==2) val[1]=0; ind = (val==2); val[ind]=val[which(ind)-1]; rep(val,x.rle$l); };

## OP's test case
x <- c(NA,NA,NA,NA,0,1,0,1,NA,NA,NA,0,NA,NA,1,NA,NA,0);

ex <- bgoldst(x);
identical(ex,user31264(x));
## [1] TRUE

microbenchmark(bgoldst(x),user31264(x));
## Unit: microseconds
##          expr    min     lq     mean median     uq    max neval
##    bgoldst(x) 10.264 11.548 14.39548 12.403 13.258 73.557   100
##  user31264(x) 31.646 32.930 35.74805 33.785 35.068 84.676   100

## scale test, high probability of NA
set.seed(1L);
N <- 1e5L; probNA <- 0.8; x <- sample(c(NA,T),N,T,c(probNA,1-probNA)); x[which(x)] <- rep(c(0,1),len=sum(x,na.rm=T));

ex <- bgoldst(x);
identical(ex,user31264(x));
## [1] TRUE

microbenchmark(bgoldst(x),user31264(x));
## Unit: milliseconds
##          expr      min       lq     mean   median       uq      max neval
##    bgoldst(x) 10.94491 11.21860 12.50473 11.53015 12.28945 50.25899   100
##  user31264(x) 17.18649 18.35634 22.50400 18.91848 19.53708 65.02668   100

## scale test, low probability of NA
set.seed(1L);
N <- 1e5L; probNA <- 0.2; x <- sample(c(NA,T),N,T,c(probNA,1-probNA)); x[which(x)] <- rep(c(0,1),len=sum(x,na.rm=T));

ex <- bgoldst(x);
identical(ex,user31264(x));
## [1] TRUE

microbenchmark(bgoldst(x),user31264(x));
## Unit: milliseconds
##          expr      min        lq      mean    median        uq      max neval
##    bgoldst(x)  5.24815  6.351279  7.723068  6.635454  6.923264 45.04077   100
##  user31264(x) 11.79423 13.063710 22.367334 13.986584 14.908603 55.45453   100

answered Nov 11 '22 20:11

bgoldst

You can make use of na.locf from the zoo package:

library(zoo)
x <- c(NA,NA,NA,NA,0,1,0,1,NA,NA,NA,0,NA,NA,1,NA,NA,0)

v <- na.locf(zoo(x))
index(v)[v==1]
#[1]  6  8  9 10 11 15 16 17

answered Nov 11 '22 22:11

Marat Talipov

x <- c(NA,NA,NA,NA,0,1,0,1,NA,NA,NA,0,NA,NA,1,NA,NA,0)
x[is.na(x)]=2
x.rle=rle(x)
val=x.rle$v
if (val[1]==2) val[1]=0
ind = (val==2)
val[ind]=val[which(ind)-1]
rep(val,x.rle$l)

Output:

[1] 0 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0

answered Nov 11 '22 20:11

user31264

Pasting the sequence into a string and using a while loop that checks (with grep) whether there are any NAs preceded by 1s and substitutes (with gsub) such cases with a 1 will do it:

# substitute NA for "N" for later ease of processing and locating 1s by position
x[is.na(x)] <- "N"
# Collapse vector into a string
stringx <- paste(x, collapse = "")

while(grepl("(?<=1)N", stringx, perl = TRUE)) {
    stringx <- gsub("(?<=1)N", "1", stringx, perl = TRUE)
}

Then you can use gregexpr to get the indices of 1s.

unlist(gregexpr("1", stringx))
#[1]  6  8  9 10 11 15 16 17

Or you can split the string and look through to find the indices of 1s in the resulting vector:

newx <-unlist(strsplit(stringx, ""))
#[1] "N" "N" "N" "N" "0" "1" "0" "1" "1" "1" "1" "0" "N" "N" "1" "1" "1" "0"

which(newx == "1") 
#[1]  6  8  9 10 11 15 16 17

Using stri_flatten from the stringi package instead of paste and stri_locate_all_fixed rather than gregexpr or a string splitting route can provide a little bit more performance if it's a larger vector you're processing. If the vector isn't large, no performance gains will result.

library(stringi)
x[is.na(x)] <- "N"
stringx <- stri_flatten(x)

while(grepl("(?<=1)N", stringx, perl = TRUE)) {
    stringx <- gsub("(?<=1)N", "1", stringx, perl = TRUE) 
}

stri_locate_all_fixed(stringx, "1")[[1]][,"start"]

The following approach is fairly straightforward and performs relatively well (based on bgoldst's excellent benchmarking example) on small and large samples (very well on bgoldst's high probability of NA example)

x[is.na(x)] <- "N"
stringx <- stri_flatten(x)

ones <- stri_locate_all_regex(stringx, "1N*")[[1]]

#[[1]]
#
#     start end
#[1,]     6   6
#[2,]     8  11
#[3,]    15  17

unlist(lapply(seq_along(ones[, 1]),
    function(ii) seq.int(ones[ii, "start"], ones[ii, "end"])))
#[1]  6  8  9 10 11 15 16 17

answered Nov 11 '22 21:11

Jota

Related questions
                            
                                How to create a discrete normal distribution in R?
                            
                                Including ASCII art in R
                            
                                Change point colors and color of frame/ellipse around points
                            
                                What is the difference the zoo object and ts object in R?
                            
                                ggplot legend: position of key relative to labels
                            
                                R: How to get a sum of two distributions?
                            
                                How to upload an image into RStudio Notebook?
                            
                                Add leading 0 with gsub
                            
                                Read data from a multi separated csv file in R
                            
                                Get the longest element of a list
                            
                                Probability of the Union of Three or More Sets
                            
                                ggplot2 + plotly : Axis title disappear
                            
                                Merge two data frames to get alternate rows of each data frame in sequence
                            
                                Specify download folder in RSelenium
                            
                                Plot a simple conversion funnel in ggplot
                            
                                How do I count the number of unique vectors in a list?
                            
                                Why is using list() critical for .dots = setNames() uses in dplyr?
                            
                                R - Print table with columns sums below
                            
                                Make all elemants of a character vector the same length
                            
                                How to identify only "not duplicated" rows

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With