Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract last numbers of sequences in a vector in R

Tags:

r

This my vector:

myvector<-c(1L, 2L, 4L, 5L, 6L, 7L, 8L, 10L, 12L, 142L, 143L, 149L, 150L)

As you can see there some sequences inside this vector:

Seq1: 1,2
Seq2: 4,5,6,7,8
Seq3: 10
Seq4: 12
Seq5: 142,143
Seq6: 149,150

Im trying to implement a code that identifies this sequences and extract the last onee. The result should be:

output<- c(2L, 8L,10L,12L, 143L, 150L)

I have other vectors bigger ones. But if I am able to do this with this vector I will be able to do with the others.

I tried to use diff but the last element is deleted.

Any help guys?

like image 682
Laura Avatar asked Mar 04 '23 12:03

Laura


2 Answers

We can create a grouping vector with diff and cumsum, use that in tapply to extract the last element

unname(tapply(myvector, cumsum(c(TRUE, diff(myvector) != 1)), 
      FUN = tail, 1))
#[1]   2   8  10  12 143 150

Or another simple option is

by(myvector, cumsum(c(TRUE, diff(myvector) != 1)), FUN = tail, 1)

Or an option is split into a list, extract the last element by looping through the list

lst1 <- split(myvector, cumsum(c(TRUE, diff(myvector) != 1)),)
unname(sapply(lst1, tail, 1))
#[1] 2   8  10  12 143 150 

Or create a grouping column in a data.frame/tibble and then do a regular slice/filter

library(tidyverse)
tibble(val = myvector, grp = cumsum(c(TRUE, diff(val) != 1))) %>% 
      group_by(grp) %>%          
      slice(n()) %>% 
      pull(val)
#[1]   2   8  10  12 143 150
like image 117
akrun Avatar answered Mar 19 '23 23:03

akrun


Here is another solution just with subsetting

myvector<-c(1L, 2L, 4L, 5L, 6L, 7L, 8L, 10L, 12L, 142L, 143L, 149L, 150L)

myvector[which(diff(myvector) == 1)[!diff(which(diff(myvector, lag=1) == 1) + 1) == 1] + 1]

Explanation

  1. Identify sequences

which(diff(myvector) == 1)

[1] 1 3 4 5 6 10 12

  1. Identify ends of sequences

!diff(which(diff(myvector, lag=1) == 1) + 1) == 1

notice that this is a subset of the sequence vector

[1] 1 6 10 12

  1. Fix the index

+1

[1] 2 7 11 13

These are the indices for the last elements of sequences! :)

Optimization

Save subsetting operation done twice

seqs <- which(diff(myvector) == 1)
myvector[seqs[!diff(seqs + 1) == 1] + 1]

microbenchmark::microbenchmark({seqs <- which(diff(myvector) == 1)
+ myvector[seqs[!diff(seqs + 1) == 1] + 1]})
Unit: microseconds
                                                                                    expr
 {     seqs <- which(diff(myvector) == 1)     myvector[seqs[!diff(seqs + 1) == 1] + 1] }

   min      lq    mean median      uq    max neval
11.773 12.3345 13.2772 12.473 12.7435 68.969   100

microbenchmark::microbenchmark({myvector[which(diff(myvector) == 1)[!diff(which(diff(myvector, lag=1) == 1) + 1) == 1] + 1]})
Unit: microseconds
                                                                                                           expr
 {     myvector[which(diff(myvector) == 1)[!diff(which(diff(myvector,          lag = 1) == 1) + 1) == 1] + 1] }
    min     lq     mean  median     uq    max neval
 17.721 18.295 19.44263 18.5855 18.926 82.875   100

Solution including single values

Even simpler since we do not have to take care of whether a value is part of a sequence or not. We subset by whether the next value breaks the "sequence". The final value is included in any case. Either it ends a sequence or it is a single value but we know there is not another incremental integer.

myvector<-c(1L, 2L, 4L, 5L, 6L, 7L, 8L, 10L, 12L, 142L, 143L, 149L, 150L)
# Test with different vector
myvector2<-c(1L, 2L, 4L, 5L, 6L, 7L, 8L, 10L, 12L, 142L, 143L, 148L, 150L)

lastSeq <- function(vector){
   return(vector[c(which(diff(vector) != 1), length(vector))] )
}
lastSeq(myvector)
lastSeq(myvector2)
like image 39
Daniel Winkler Avatar answered Mar 19 '23 21:03

Daniel Winkler