Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find all subsequences with specific length in sequence of numbers in R

Tags:

r

sequence

I want to find all subsequences within a sequence with (minimum) length of n. Lets assume I have this sequence

sequence <- c(1,2,3,2,5,3,2,6,7,9)

and I want to find the increasing subsequences with minimum length of 3. The ouput should be a dataframe with start and end position for each subsequence found.

df =data.frame(c(1,7),c(3,10))
colnames(df) <- c("start", "end")

Can somebody give a hint how to solve my problem?

Thanks in advance!

like image 875
MatthiasHab1986 Avatar asked Apr 04 '19 12:04

MatthiasHab1986


2 Answers

One way using only base R

n <- 3

do.call(rbind, sapply(split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1)), 
        function(x) if (length(x) >= n) c(start = x[1], end = x[length(x)])))

#  start end
#1    1    3
#4    7   10

split the index of sequence based on the continuous incremental subsequences, if the length of each group is greater than equal to n return the start and end index of that group.


To understand lets break this down and understand it step by step

Using diff we can find difference between consecutive elements

diff(sequence)
#[1]  0  1  1 -1  3 -2 -1  4  1  2

We check which of them do not have increasing subsequences

diff(sequence) < 1
#[1] FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE

and take cumulative sum over them to create groups

cumsum(c(0, diff(sequence)) < 1)
#[1] 1 1 1 2 2 3 4 4 4 4

Based on this groups, we split the index from 1:length(sequence)

split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1))
#$`1`
#[1] 1 2 3

#$`2`
#[1] 4 5

#$`3`
#[1] 6

#$`4`
#[1]  7  8  9 10

Using sapply we loop over this list and return the start and end index of the list if the length of the list is >= n (3 in this case)

sapply(split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1)), 
       function(x) if (length(x) >= n) c(start = x[1], end = x[length(x)]))

#$`1`
#start   end 
#    1     3 

#$`2`
# NULL

#$`3`
#NULL

#$`4`
#start   end 
#    7    10 

Finally, rbind all of them together using do.call. NULL elements are automatically ignored.

do.call(rbind, sapply(split(1:length(sequence), cumsum(c(0, diff(sequence)) < 1)), 
       function(x) if (length(x) >= n) c(start = x[1], end = x[length(x)])))

#  start end
#1     1   3
#4     7  10
like image 97
Ronak Shah Avatar answered Oct 13 '22 01:10

Ronak Shah


Here is another solution using base R. I tried to comment it well but it may still be hard to follow. It seems like you wanted direction / to learn, more than an outright answer so definitely follow up with questions if anything is unclear (or doesn't work for your actual application).

Also, for your data, I added a 12 on the end to make sure it was returning the correct position for repeated increases greater than n (3 in this case):

# Data (I added 11 on the end)
sequence <- c(1,2,3,2,5,3,2,6,7,9, 12)

# Create indices for whether or not the numbers in the sequence increased
indices <- c(1, diff(sequence) >= 1)
indices
[1] 1 1 1 0 1 0 0 1 1 1 1

Now that we have the indices, we need to get the start and end postions for repeates >= 3

# Finding increasing sequences of n length using rle
n <- 3
n <- n - 1

# Examples 
rle(indices)$lengths
[1] 3 1 1 2 4

rle(indices)$values
[1] 1 0 1 0 1

# Finding repeated TRUE (1) in our indices vector
reps <- rle(indices)$lengths >= n & rle(indices)$values == 1
reps
[1]  TRUE FALSE FALSE FALSE  TRUE

# Creating a vector of positions for the end of a sequence
# Because our indices are true false, we can use cumsum along
# with rle to create the positions of the end of the sequences
rle_positions <- cumsum(rle(indices)$lengths)
rle_positions
[1]  3  4  5  7 11

# Creating start sequence vector and subsetting start / end using reps
start <- c(1, head(rle_positions, -1))[reps]

end <- rle_positions[reps]

data.frame(start, end)
  start end
1     1   3
2     7  11

Or, concisely:

n <- 3
n <- n-1
indices <- c(1, diff(sequence) >= 1)
reps <- rle(indices)$lengths >= n & rle(indices)$values == 1
rle_positions <- cumsum(rle(indices)$lengths)
data.frame(start = c(1, head(rle_positions, -1))[reps], 
           end = rle_positions[reps])
  start end
1     1   3
2     7  11

EDIT: @Ronak's update made me realize I should be using diff instead of sapply with an anonymous function for my first step. Updated the answer b/c it was not catching an increase at the end of the vector (e.g., sequence <- c(1,2,3,2,5,3,2,6,7,9,12, 11, 11, 20, 100), also needed to add one more line under n <- 3. This should work as intended now.

like image 28
Andrew Avatar answered Oct 13 '22 00:10

Andrew