Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find and break on repeated runs

Tags:

r

I have a vector with repeating patterns within it. I want to break any where the repeating pattern of n length changes. Here's the data:

x <- c(rep(1:4, 5), rep(5:6, 3), rep(c(1, 4, 7), 5), rep(c(1, 5, 7), 1), rep(2:4, 3))

##  [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 5 6 5 6 5 6 1 4 7 1 4 7 1 4 7 1 4 7 1 4 7 1 5 7 2 3 4 2 3 4 2 3 4

I want to be able to find those places the pattern changes so it breaks like this:

enter image description here

I think rle may be of use but don't see how.

like image 233
Tyler Rinker Avatar asked Oct 15 '15 18:10

Tyler Rinker


People also ask

Does Break work with for loop?

break terminates the execution of a for or while loop. Statements in the loop after the break statement do not execute. In nested loops, break exits only from the loop in which it occurs.

How does break work in multiple loops?

Using break in a nested loop In a nested loop, a break statement only stops the loop it is placed in. Therefore, if a break is placed in the inner loop, the outer loop still continues. However, if the break is placed in the outer loop, all of the looping stops.

How do you break a while loop?

To break out of a while loop, you can use the endloop, continue, resume, or return statement. endwhile; If the name is empty, the other statements are not executed in that pass through the loop, and the entire loop is closed.


2 Answers

Here's a function to do it. By the way, this is a problem in genetics - finding tandem repeats. Here's a link to an algorithm paper that is a much better treatment than this, but much more complicated to implement.

The output is a vector of groups to split x into.

First a helper function:

factorise <- function(x) {
  x <- length(x)
  if(x == 1){return(1)}
  todivide <- seq(from = 2, to = x)
  out <- todivide[x %% todivide == 0L]
  return(out)
}

Now the main function:

findreps <- function(x, counter = NULL){
  if(is.null(counter)){
    counter <- c()
    maxcounter <- 0
  } else {
    maxcounter <- max(counter)
  }
  holding <- lapply(1:length(x), function(y){x[1:y]})
  factors <- lapply(holding, factorise)
  repeats <- sapply(1:length(factors), function(index) {any(sapply(1:length(factors[[index]]), function(zz) {all((rep(holding[[index]][1:(length(holding[[index]])/factors[[index]][zz])], factors[[index]][zz]))==holding[[index]])}))})
  holding <- holding[max(which(repeats))][[1]]
  if(length(holding) == length(x)){
    return(c(counter, rep(maxcounter + 1, length(x))))
  } else {
    counter <- c(counter, rep(maxcounter + 1, length(holding)))
    return(findreps(x[(length(holding) + 1):length(x)], counter))
  }
}

How it works: It's a recursive function that runs, cuts off the biggest repeats group it can find from the start of the vector, and then runs until they are all gone.

First we make a counter for the final output.

Next, we split x into each subset starting from 1 into a list, holding.

Then we find all factors of the size of a group, except 1.

Then is the worst part. We take each subset of the biggest subset, and check if it is equal to the biggest subset in its group after being repeated the sensible amount of times.

findreps(x)
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
[37] 3 3 3 3 3 4 5 6 7 7 7 7 7 7 7 7 7

If you want non-repeats to be grouped, we can use a little dplyr and tidyr:

library(dplyr)
library(tidyr)

z <- data.frame(x = x, y = findreps(x))

z %>% mutate(y = ifelse(duplicated(y) | rev(duplicated(rev(y))), y, NA),
             holding = c(0, y[2:n()])) %>%
      fill(holding) %>%
      mutate(y = ifelse(is.na(y), holding +1, y)) %>%
      select(-holding)

Which gives:

 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 7 7 7 7 7 7 7 7
[53] 7
like image 188
jeremycg Avatar answered Sep 17 '22 23:09

jeremycg


I am almost there, but I doesn't work for the full 100% and it is getting late (zzz). First the code:

x <-c(rep(1:4, 5), rep(5:6, 3), rep(c(1, 4, 7), 5), rep(c(1, 5, 7), 1), rep(2:4, 3))

#The first break must be position 1
Xbreaklist <- 1

#We need a counter, a duplicate dataset 
counter <- 0
xx <- x

while (length(xx) > 0) {
#first we extract a pattern by looking for the first repeated number
Xpattern <- xx[1:(min(which(stri_duplicated(xx) == TRUE))-1)]

#then we convert the vector and the pattern into a string
XpatternS <- paste0(Xpattern, collapse="")
xxS <- paste0(xx, collapse="")

#then we extract all patterns and count them, multiply by length and add 1 
Xbreak <- 1 + (length(unlist(stri_extract_all_coll(xxS, XpatternS))) * length(Xpattern))

#break here if we reached the end 
if (Xbreak >= length(xx)) break

# We add that to the list of breaks
counter <- counter + Xbreak
Xbreaklist <- c(Xbreaklist, counter)

# then we remove the part of the list we're done with
xx <- xx[(Xbreak):length(xx)]
}

Xbreaklist
[1]  1 21 28 44 51

What is wrong with it? Two things:
1 A pattern that is not repeated takes the first occurrence of the next pattern with it: "121212 56 787878" gets split as ("121212 5678 7878")
2 Repeating patterns ("1212 5656 12 134") mess things up because stri_extract_all_coll takes them all out and hence length is to long.

like image 42
RHA Avatar answered Sep 18 '22 23:09

RHA