I have a vector with repeating patterns within it. I want to break any where the repeating pattern of n length changes. Here's the data:
x <- c(rep(1:4, 5), rep(5:6, 3), rep(c(1, 4, 7), 5), rep(c(1, 5, 7), 1), rep(2:4, 3))
## [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 5 6 5 6 5 6 1 4 7 1 4 7 1 4 7 1 4 7 1 4 7 1 5 7 2 3 4 2 3 4 2 3 4
I want to be able to find those places the pattern changes so it breaks like this:
I think rle
may be of use but don't see how.
break terminates the execution of a for or while loop. Statements in the loop after the break statement do not execute. In nested loops, break exits only from the loop in which it occurs.
Using break in a nested loop In a nested loop, a break statement only stops the loop it is placed in. Therefore, if a break is placed in the inner loop, the outer loop still continues. However, if the break is placed in the outer loop, all of the looping stops.
To break out of a while loop, you can use the endloop, continue, resume, or return statement. endwhile; If the name is empty, the other statements are not executed in that pass through the loop, and the entire loop is closed.
Here's a function to do it. By the way, this is a problem in genetics - finding tandem repeats. Here's a link to an algorithm paper that is a much better treatment than this, but much more complicated to implement.
The output is a vector of groups to split x into.
First a helper function:
factorise <- function(x) {
x <- length(x)
if(x == 1){return(1)}
todivide <- seq(from = 2, to = x)
out <- todivide[x %% todivide == 0L]
return(out)
}
Now the main function:
findreps <- function(x, counter = NULL){
if(is.null(counter)){
counter <- c()
maxcounter <- 0
} else {
maxcounter <- max(counter)
}
holding <- lapply(1:length(x), function(y){x[1:y]})
factors <- lapply(holding, factorise)
repeats <- sapply(1:length(factors), function(index) {any(sapply(1:length(factors[[index]]), function(zz) {all((rep(holding[[index]][1:(length(holding[[index]])/factors[[index]][zz])], factors[[index]][zz]))==holding[[index]])}))})
holding <- holding[max(which(repeats))][[1]]
if(length(holding) == length(x)){
return(c(counter, rep(maxcounter + 1, length(x))))
} else {
counter <- c(counter, rep(maxcounter + 1, length(holding)))
return(findreps(x[(length(holding) + 1):length(x)], counter))
}
}
How it works: It's a recursive function that runs, cuts off the biggest repeats group it can find from the start of the vector, and then runs until they are all gone.
First we make a counter
for the final output.
Next, we split x
into each subset starting from 1 into a list, holding
.
Then we find all factors of the size of a group, except 1.
Then is the worst part. We take each subset of the biggest subset, and check if it is equal to the biggest subset in its group after being repeated the sensible amount of times.
findreps(x)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
[37] 3 3 3 3 3 4 5 6 7 7 7 7 7 7 7 7 7
If you want non-repeats to be grouped, we can use a little dplyr
and tidyr
:
library(dplyr)
library(tidyr)
z <- data.frame(x = x, y = findreps(x))
z %>% mutate(y = ifelse(duplicated(y) | rev(duplicated(rev(y))), y, NA),
holding = c(0, y[2:n()])) %>%
fill(holding) %>%
mutate(y = ifelse(is.na(y), holding +1, y)) %>%
select(-holding)
Which gives:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 7 7 7 7 7 7 7 7
[53] 7
I am almost there, but I doesn't work for the full 100% and it is getting late (zzz). First the code:
x <-c(rep(1:4, 5), rep(5:6, 3), rep(c(1, 4, 7), 5), rep(c(1, 5, 7), 1), rep(2:4, 3))
#The first break must be position 1
Xbreaklist <- 1
#We need a counter, a duplicate dataset
counter <- 0
xx <- x
while (length(xx) > 0) {
#first we extract a pattern by looking for the first repeated number
Xpattern <- xx[1:(min(which(stri_duplicated(xx) == TRUE))-1)]
#then we convert the vector and the pattern into a string
XpatternS <- paste0(Xpattern, collapse="")
xxS <- paste0(xx, collapse="")
#then we extract all patterns and count them, multiply by length and add 1
Xbreak <- 1 + (length(unlist(stri_extract_all_coll(xxS, XpatternS))) * length(Xpattern))
#break here if we reached the end
if (Xbreak >= length(xx)) break
# We add that to the list of breaks
counter <- counter + Xbreak
Xbreaklist <- c(Xbreaklist, counter)
# then we remove the part of the list we're done with
xx <- xx[(Xbreak):length(xx)]
}
Xbreaklist
[1] 1 21 28 44 51
What is wrong with it? Two things:
1 A pattern that is not repeated takes the first occurrence of the next pattern with it: "121212 56 787878" gets split as ("121212 5678 7878")
2 Repeating patterns ("1212 5656 12 134") mess things up because stri_extract_all_coll
takes them all out and hence length is to long.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With