I have a vector:
seq1<-c('a','b','c','b','a','b','c','b','a','b','c')
I wish to permute the elements of this vector to create multiple (ideally up to 5000) vectors with the condition that the permuted vectors cannot have repeated elements within the vector in consecutive elements. e.g. "abbca...." is not allowed as 'b-b' is a repeat.
I realize that for this small example there probably are not 5000 solutions. I am typically dealing with much larger vectors. I am also willing to consider sampling with replacement, though currently I'm working on solutions without replacement.
I am looking for better solutions than my current thinking.
Option 1. - brute force.
Here, I just repeatedly sample and check if any successive elements are duplicates.
set.seed(18)
seq1b <- sample(seq1a)
seq1b
#[1] "b" "b" "a" "a" "c" "b" "b" "c" "a" "c" "b"
sum(seq1b[-length(seq1b)]==seq1b[-1]) #3
This is not a solution as there are 3 duplicated consecutive elements. I also realize that lag
is probably a better way to check for duplicated elements but for some reason it is being finicky (I think it is being masked by another package I have loaded).
set.seed(1000)
res<-NULL
for (i in 1:10000){res[[i]]<-sample(seq1a)}
res1 <- lapply(res, function(x) sum(x[-length(x)]==x[-1]))
sum(unlist(res1)==0) #228
This produces 228 options out of 10000 iterations. But let's see how many unique ones:
res2 <- res[which(unlist(res1)==0)]
unique(unlist(lapply(res2, paste0, collapse=""))) #134
Out of 10000 attempts we only get 134 unique ones from this short example vector.
Here are 3 of the 134 example sequences produced:
# "bcbabcbabca" "cbabababcbc" "bcbcababacb"
In fact, if I try over 500,000 samples, I can only get 212 unique sequences that match my non-repeating criteria. This is probably close to the upper limit of possible ones.
Option 2. - iteratively
A second idea I had is to be more iterative about the approach.
seq1a
table(seq1a)
#a b c
#3 5 3
We could sample one of these letters as our starting point. Then sample another from the remaining ones, check if it is the same as the previously chosen one and if not, add it to the end. And so on and so forth...
set.seed(10)
newseq <- sample(seq1a,1) #b
newseq #[1] "b"
remaining <-seq1a[!seq1a %in% newseq | duplicated(seq1a)]
table(remaining)
#a b c
#3 4 3
set.seed(10)
newone <- sample(remaining,1) #c
#check if newone is same as previous one.
newone==newseq[length(newseq)] #FALSE
newseq <- c(newseq, newone) #update newseq
newseq #[1] "b" "c"
remaining <-seq1a[!seq1a %in% newseq | duplicated(seq1a)] #update remaining
remaining
table(remaining)
#a b c
#3 4 2
This might work, but I can also see it running into lots of issues - e.g. we could go:
# "a" "c" "a" "c" "a" "b" ...
and then be left with 3 more 'b's that cannot go at the end as they'd be duplicates.
Of course, this would be a lot easier if I allowed sampling with replacement, but for now I'm trying to do this without replacement.
You can use the iterpc
package to work with combinations and iterations. I hadn't heard of it until trying to answer this question so there might also be more effective ways to use the same package.
Here I've used iterpc
to set up an iterator, and getall
to find all combinations of the vector based on that iterator. This seems to just report unique combinations, making it a bit nicer than finding all combinations with expand.grid
.
#install.packages("iterpc")
require("iterpc")
seq1 <- c('a','b','c','b','a','b','c','b','a','b','c')
I <- iterpc(n = table(seq1), ordered=TRUE)
all_seqs <- getall(I)
# result is a matrix with permutations as rows:
head(all_seqs)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
#[1,] "a" "a" "a" "b" "b" "b" "b" "b" "c" "c" "c"
#[2,] "a" "a" "a" "b" "b" "b" "b" "c" "b" "c" "c"
#[3,] "a" "a" "a" "b" "b" "b" "b" "c" "c" "b" "c"
#[4,] "a" "a" "a" "b" "b" "b" "b" "c" "c" "c" "b"
#[5,] "a" "a" "a" "b" "b" "b" "c" "b" "b" "c" "c"
#[6,] "a" "a" "a" "b" "b" "b" "c" "b" "c" "b" "c"
The rle
function tells us about consecutive values equal to each other in a vector. The lengths
component of the output tells us how many times each element of values
is repeated:
rle(c("a", "a", "b", "b", "b", "c", "b"))
# Run Length Encoding
# lengths: int [1:3] 2 3 1 1
# values : chr [1:3] "a" "b" "c" "b"
The length of values
or lengths
will be equal to the length of the original vector only for combinations which have no consecutive repeats.
You can therefore apply rle
to each row, calculate the length of values
or lengths
and keep rows from all_seqs
where the calculated value is the same as the length of seqs1
.
#apply the rle function
all_seqs_rle <- apply(getall(I), 1, function(x) length(rle(x)$values))
# keep rows which have an rle with a length equal to length(seq1)
all_seqs_good <- all_seqs[which(all_seqs_rle == length(seq1)), ]
all_seqs_good
has an nrow
of 212, suggesting that you did indeed find all possible combinations for your example vector.
nrow(all_seqs_good)
# 212
Technically this is still brute forcing (except that it doesn't calculate every possible combination - only unique ones), but is fairly quick for your example. I'm not sure how well it will cope with larger vectors yet...
Edit: this does seem to fail for larger vectors. One solution would be to break larger vectors into smaller chunks, then process those chunks as above and combine them - keeping only the combinations which meet your criteria.
For example, breaking a vector of length 24 into two vectors of length 12, then combining the results can give you 200,000+ combinations which meet your critera and is pretty quick (around 1 minute for me):
# function based on the above solution
seq_check <- function(mySeq){
I = iterpc(n = table(mySeq), ordered=TRUE)
all_seqs <- getall(I)
all_seqs_rle <- apply(getall(I), 1, function(x) length(rle(x)$values))
all_seqs_good <- all_seqs[which(all_seqs_rle == length(mySeq)), ]
return(all_seqs_good)
}
set.seed(1)
seq1<-sample(c(rep("a", 8), rep("b", 8), rep("c", 8)),24)
seq1a <- seq1[1:12]
seq1b <- seq1[13:24]
#get all permutations with no consecutive repeats
seq1a = apply(seq_check(seq1a), 1, paste0, collapse="")
seq1b = apply(seq_check(seq1b), 1, paste0, collapse="")
#combine seq1a and seq1b:
combined_seqs <- expand.grid(seq1a, seq1b)
combined_seqs <- apply(combined_seqs, 1, paste0, collapse="")
#function to calculate rle lengths
rle_calc <- function(x) length(rle(unlist(strsplit(x, "")))$values)
#keep combined sequences which have rle lengths of 24
combined_seqs_rle <- sapply(combined_seqs, rle_calc)
passed_combinations <- combined_seqs[which(combined_seqs_rle == 24)]
#find number of solutions
length(passed_combinations)
#[1] 245832
length(unique(passed_combinations))
#[1] 245832
You might need to re-order the starting vector for best results. For example, if seq1
in the above example had started with "a" eight times in a row, there would be no passing solutions. For example, try the splitting up solution with seq1 <- c(rep("a", 8), rep("b", 8), rep("c", 8))
and you get no solutions back, even though there are really the same number of solutions for the random sequence.
It doesn't look like you need to find every possible passing combination, but if you do then for larger vectors you'll probably need to iterate through I
using the getnext
function from iterpc
, and check each one in a loop which would be very slow.
Here another solution. Please see the comments in the code for an explanation of the algorithm. In a way, it's similar to your second (iterative) approach, but it includes
while
loop that ensures that the next element is validThe algorithm is also quite efficient with longer seq1
vectors as given in one of your comments. But I guess it's performance will degrade if you have more unique elements in seq1
.
Here the code: First a few definitions
set.seed(1234)
seq1=c('a','b','c','b','a','b','c','b','a','b','c')
#number of attempts to generate a valid combination
Nres=10000
#this list will hold the results
#we do not have to care about memory allocation
res_list=list()
Now generate the combinations
#the outer loop creates the user-defined number of combination attempts
for (i in 1:Nres) {
#create a "population" from seq1
popul=seq1
#pre-allocate an NA vector of the same length as seq1
res_vec=rep(NA_character_,length(seq1))
#take FIRST draw from the population
new_draw=sample(popul,1)
#remove draw from population
popul=popul[-match(new_draw,popul)]
#save new draw
res_vec[1]=new_draw
#now take remaining draws
for (j in 2:length(seq1)) {
#take new draws as long as
#1) new_draw is equal to the last draw and
#2) as long as there are any valid elements left in popul
while((new_draw==res_vec[j-1])&any(res_vec[j-1]!=popul)) {
#take new draw
new_draw=sample(popul,1)
}
#if we did not find a valid draw break inner loop
if (new_draw==res_vec[j-1]) {
break
}
#otherwise save new_draw ...
res_vec[j]=new_draw
#... and delete new draw from population
popul=popul[-match(new_draw,popul)]
}
#this is to check whether we had to break the inner loop
#if not, save results vector
if (sum(is.na(res_vec[j]))==0) res_list[[length(res_list)+1]]=res_vec
}
Now let's check the results
#for each result vector in res_list:
#1) check whether all subsequent elements are different ---> sum(x[-1]==x[-length(x)])==0
#2) and whether we have the same number of elements as in seq1 ---> all.equal(table(x),table(seq1),check.attributes=FALSE)
sum(sapply(res_list,function(x) (sum(x[-1]==x[-length(x)])==0)&all.equal(table(x),table(seq1),check.attributes=FALSE)))
#6085
#the previous number should be the same as the length of res_list
length(res_list)
#6085
#check the number of unique solutions
length(unique(res_list))
#212
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With