Find and break on repeated runs

Tags:

r

I have a vector with repeating patterns within it. I want to break any where the repeating pattern of n length changes. Here's the data:

x <- c(rep(1:4, 5), rep(5:6, 3), rep(c(1, 4, 7), 5), rep(c(1, 5, 7), 1), rep(2:4, 3))

##  [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 5 6 5 6 5 6 1 4 7 1 4 7 1 4 7 1 4 7 1 4 7 1 5 7 2 3 4 2 3 4 2 3 4

I want to be able to find those places the pattern changes so it breaks like this:

enter image description here

I think rle may be of use but don't see how.

233

asked Oct 15 '15 18:10

2 Answers

Here's a function to do it. By the way, this is a problem in genetics - finding tandem repeats. Here's a link to an algorithm paper that is a much better treatment than this, but much more complicated to implement.

The output is a vector of groups to split x into.

First a helper function:

factorise <- function(x) {
  x <- length(x)
  if(x == 1){return(1)}
  todivide <- seq(from = 2, to = x)
  out <- todivide[x %% todivide == 0L]
  return(out)
}

Now the main function:

findreps <- function(x, counter = NULL){
  if(is.null(counter)){
    counter <- c()
    maxcounter <- 0
  } else {
    maxcounter <- max(counter)
  }
  holding <- lapply(1:length(x), function(y){x[1:y]})
  factors <- lapply(holding, factorise)
  repeats <- sapply(1:length(factors), function(index) {any(sapply(1:length(factors[[index]]), function(zz) {all((rep(holding[[index]][1:(length(holding[[index]])/factors[[index]][zz])], factors[[index]][zz]))==holding[[index]])}))})
  holding <- holding[max(which(repeats))][[1]]
  if(length(holding) == length(x)){
    return(c(counter, rep(maxcounter + 1, length(x))))
  } else {
    counter <- c(counter, rep(maxcounter + 1, length(holding)))
    return(findreps(x[(length(holding) + 1):length(x)], counter))
  }
}

How it works: It's a recursive function that runs, cuts off the biggest repeats group it can find from the start of the vector, and then runs until they are all gone.

First we make a counter for the final output.

Next, we split x into each subset starting from 1 into a list, holding.

Then we find all factors of the size of a group, except 1.

Then is the worst part. We take each subset of the biggest subset, and check if it is equal to the biggest subset in its group after being repeated the sensible amount of times.

findreps(x)
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
[37] 3 3 3 3 3 4 5 6 7 7 7 7 7 7 7 7 7

If you want non-repeats to be grouped, we can use a little dplyr and tidyr:

library(dplyr)
library(tidyr)

z <- data.frame(x = x, y = findreps(x))

z %>% mutate(y = ifelse(duplicated(y) | rev(duplicated(rev(y))), y, NA),
             holding = c(0, y[2:n()])) %>%
      fill(holding) %>%
      mutate(y = ifelse(is.na(y), holding +1, y)) %>%
      select(-holding)

Which gives:

 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 7 7 7 7 7 7 7 7
[53] 7

188

answered Sep 17 '22 23:09

jeremycg

I am almost there, but I doesn't work for the full 100% and it is getting late (zzz). First the code:

x <-c(rep(1:4, 5), rep(5:6, 3), rep(c(1, 4, 7), 5), rep(c(1, 5, 7), 1), rep(2:4, 3))

#The first break must be position 1
Xbreaklist <- 1

#We need a counter, a duplicate dataset 
counter <- 0
xx <- x

while (length(xx) > 0) {
#first we extract a pattern by looking for the first repeated number
Xpattern <- xx[1:(min(which(stri_duplicated(xx) == TRUE))-1)]

#then we convert the vector and the pattern into a string
XpatternS <- paste0(Xpattern, collapse="")
xxS <- paste0(xx, collapse="")

#then we extract all patterns and count them, multiply by length and add 1 
Xbreak <- 1 + (length(unlist(stri_extract_all_coll(xxS, XpatternS))) * length(Xpattern))

#break here if we reached the end 
if (Xbreak >= length(xx)) break

# We add that to the list of breaks
counter <- counter + Xbreak
Xbreaklist <- c(Xbreaklist, counter)

# then we remove the part of the list we're done with
xx <- xx[(Xbreak):length(xx)]
}

Xbreaklist
[1]  1 21 28 44 51

What is wrong with it? Two things:
1 A pattern that is not repeated takes the first occurrence of the next pattern with it: "121212 56 787878" gets split as ("121212 5678 7878")
2 Repeating patterns ("1212 5656 12 134") mess things up because stri_extract_all_coll takes them all out and hence length is to long.

answered Sep 18 '22 23:09

RHA

Related questions
                            
                                Usage of Dot / Period in R Functions
                            
                                Gdata package perl issue
                            
                                R's data.table Truncating Bits?
                            
                                How do I close unused connections after read_html in R
                            
                                Fastest way to parse a date-time string to class Date
                            
                                Building a package with devtools - throwing an error where "Author" and "Maintainer" fields are missing/empty despite being filled
                            
                                Using "..." and "replicate"
                            
                                An error ['\+' is an unrecognized escape in character string starting "\+" while creating a R package
                            
                                How to write a function that calls a function that calls data.table?
                            
                                specify dplyr column names [duplicate]
                            
                                Why does facet_grid work, but not facet_wrap?
                            
                                Why is a length one vector initially at NAM(2)?
                            
                                Where to put external files for testthat tests
                            
                                How achieve identical facet sizes and scales in several multi-facet ggplot2 graphics?
                            
                                Parse Error: "Trailing Garbage" while trying to parse JSON column in data frame
                            
                                dplyr rename - Error: `new_name` = old_name must be a symbol or a string, not formula
                            
                                ggplot2: raster plotting does not work as expected when setting alpha values
                            
                                Want only the time portion of a date-time object in R
                            
                                Passing missing argument from function to function in R
                            
                                Speed up lmer function in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Find and break on repeated runs

Tags:

r

Tyler Rinker

People also ask

2 Answers

jeremycg

RHA

Recent Activity

Donate For Us