Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generate multiple permutations of vector with non-repeating elements

Tags:

r

I have a vector:

seq1<-c('a','b','c','b','a','b','c','b','a','b','c')

I wish to permute the elements of this vector to create multiple (ideally up to 5000) vectors with the condition that the permuted vectors cannot have repeated elements within the vector in consecutive elements. e.g. "abbca...." is not allowed as 'b-b' is a repeat.

I realize that for this small example there probably are not 5000 solutions. I am typically dealing with much larger vectors. I am also willing to consider sampling with replacement, though currently I'm working on solutions without replacement.

I am looking for better solutions than my current thinking.

Option 1. - brute force.

Here, I just repeatedly sample and check if any successive elements are duplicates.

set.seed(18)
seq1b <-  sample(seq1a)
seq1b
#[1] "b" "b" "a" "a" "c" "b" "b" "c" "a" "c" "b"
sum(seq1b[-length(seq1b)]==seq1b[-1])  #3

This is not a solution as there are 3 duplicated consecutive elements. I also realize that lag is probably a better way to check for duplicated elements but for some reason it is being finicky (I think it is being masked by another package I have loaded).

set.seed(1000)
res<-NULL
for (i in 1:10000){res[[i]]<-sample(seq1a)}
res1 <- lapply(res, function(x) sum(x[-length(x)]==x[-1]))
sum(unlist(res1)==0) #228

This produces 228 options out of 10000 iterations. But let's see how many unique ones:

res2 <- res[which(unlist(res1)==0)]
unique(unlist(lapply(res2, paste0, collapse="")))  #134

Out of 10000 attempts we only get 134 unique ones from this short example vector.

Here are 3 of the 134 example sequences produced:

# "bcbabcbabca" "cbabababcbc" "bcbcababacb"

In fact, if I try over 500,000 samples, I can only get 212 unique sequences that match my non-repeating criteria. This is probably close to the upper limit of possible ones.

Option 2. - iteratively

A second idea I had is to be more iterative about the approach.

seq1a
table(seq1a)
#a b c 
#3 5 3

We could sample one of these letters as our starting point. Then sample another from the remaining ones, check if it is the same as the previously chosen one and if not, add it to the end. And so on and so forth...

set.seed(10)
newseq <- sample(seq1a,1)  #b
newseq #[1] "b"

remaining <-seq1a[!seq1a %in% newseq | duplicated(seq1a)]
table(remaining)
#a b c 
#3 4 3 

set.seed(10)
newone <- sample(remaining,1) #c

#check if newone is same as previous one.
newone==newseq[length(newseq)] #FALSE
newseq <- c(newseq, newone) #update newseq
newseq #[1] "b" "c"

remaining <-seq1a[!seq1a %in% newseq | duplicated(seq1a)] #update remaining
remaining
table(remaining)

#a b c 
#3 4 2 

This might work, but I can also see it running into lots of issues - e.g. we could go:

# "a" "c" "a" "c" "a" "b"  ...

and then be left with 3 more 'b's that cannot go at the end as they'd be duplicates.

Of course, this would be a lot easier if I allowed sampling with replacement, but for now I'm trying to do this without replacement.

like image 862
jalapic Avatar asked Jun 01 '15 02:06

jalapic


2 Answers

You can use the iterpc package to work with combinations and iterations. I hadn't heard of it until trying to answer this question so there might also be more effective ways to use the same package.

Here I've used iterpc to set up an iterator, and getall to find all combinations of the vector based on that iterator. This seems to just report unique combinations, making it a bit nicer than finding all combinations with expand.grid.

#install.packages("iterpc")
require("iterpc")

seq1 <- c('a','b','c','b','a','b','c','b','a','b','c')

I <- iterpc(n = table(seq1), ordered=TRUE)

all_seqs <- getall(I)

# result is a matrix with permutations as rows:
head(all_seqs)
#     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
#[1,] "a"  "a"  "a"  "b"  "b"  "b"  "b"  "b"  "c"  "c"   "c"  
#[2,] "a"  "a"  "a"  "b"  "b"  "b"  "b"  "c"  "b"  "c"   "c"  
#[3,] "a"  "a"  "a"  "b"  "b"  "b"  "b"  "c"  "c"  "b"   "c"  
#[4,] "a"  "a"  "a"  "b"  "b"  "b"  "b"  "c"  "c"  "c"   "b"  
#[5,] "a"  "a"  "a"  "b"  "b"  "b"  "c"  "b"  "b"  "c"   "c"  
#[6,] "a"  "a"  "a"  "b"  "b"  "b"  "c"  "b"  "c"  "b"   "c" 

The rle function tells us about consecutive values equal to each other in a vector. The lengths component of the output tells us how many times each element of values is repeated:

rle(c("a", "a", "b", "b", "b", "c", "b"))

# Run Length Encoding
#   lengths: int [1:3] 2 3 1 1
#   values : chr [1:3] "a" "b" "c" "b"

The length of values or lengths will be equal to the length of the original vector only for combinations which have no consecutive repeats.

You can therefore apply rle to each row, calculate the length of values or lengths and keep rows from all_seqs where the calculated value is the same as the length of seqs1.

#apply the rle function 
all_seqs_rle <- apply(getall(I), 1, function(x) length(rle(x)$values))

# keep rows which have an rle with a length equal to length(seq1)
all_seqs_good <- all_seqs[which(all_seqs_rle == length(seq1)), ]

all_seqs_good has an nrow of 212, suggesting that you did indeed find all possible combinations for your example vector.

nrow(all_seqs_good)
# 212 

Technically this is still brute forcing (except that it doesn't calculate every possible combination - only unique ones), but is fairly quick for your example. I'm not sure how well it will cope with larger vectors yet...

Edit: this does seem to fail for larger vectors. One solution would be to break larger vectors into smaller chunks, then process those chunks as above and combine them - keeping only the combinations which meet your criteria.

For example, breaking a vector of length 24 into two vectors of length 12, then combining the results can give you 200,000+ combinations which meet your critera and is pretty quick (around 1 minute for me):

# function based on the above solution
seq_check <- function(mySeq){
I = iterpc(n = table(mySeq), ordered=TRUE)
all_seqs <- getall(I)
all_seqs_rle <- apply(getall(I), 1, function(x) length(rle(x)$values))
all_seqs_good <- all_seqs[which(all_seqs_rle == length(mySeq)), ]
return(all_seqs_good)
}

set.seed(1)
seq1<-sample(c(rep("a", 8), rep("b", 8), rep("c", 8)),24)

seq1a <- seq1[1:12]
seq1b <- seq1[13:24]

#get all permutations with no consecutive repeats
seq1a = apply(seq_check(seq1a), 1, paste0, collapse="")
seq1b = apply(seq_check(seq1b), 1, paste0, collapse="")

#combine seq1a and seq1b: 
combined_seqs <- expand.grid(seq1a, seq1b)
combined_seqs <- apply(combined_seqs, 1, paste0, collapse="") 

#function to calculate rle lengths
rle_calc <- function(x) length(rle(unlist(strsplit(x, "")))$values)

#keep combined sequences which have rle lengths of 24
combined_seqs_rle <- sapply(combined_seqs, rle_calc)
passed_combinations <- combined_seqs[which(combined_seqs_rle == 24)]

#find number of solutions
length(passed_combinations)
#[1] 245832
length(unique(passed_combinations))
#[1] 245832

You might need to re-order the starting vector for best results. For example, if seq1 in the above example had started with "a" eight times in a row, there would be no passing solutions. For example, try the splitting up solution with seq1 <- c(rep("a", 8), rep("b", 8), rep("c", 8)) and you get no solutions back, even though there are really the same number of solutions for the random sequence.

It doesn't look like you need to find every possible passing combination, but if you do then for larger vectors you'll probably need to iterate through I using the getnext function from iterpc, and check each one in a loop which would be very slow.

like image 100
ping Avatar answered Oct 25 '22 02:10

ping


Here another solution. Please see the comments in the code for an explanation of the algorithm. In a way, it's similar to your second (iterative) approach, but it includes

  1. a while loop that ensures that the next element is valid
  2. and a stopping criterion for the case when the remaining elements would necessarily form an invalid combination

The algorithm is also quite efficient with longer seq1 vectors as given in one of your comments. But I guess it's performance will degrade if you have more unique elements in seq1.

Here the code: First a few definitions

set.seed(1234)
seq1=c('a','b','c','b','a','b','c','b','a','b','c')

#number of attempts to generate a valid combination
Nres=10000

#this list will hold the results
#we do not have to care about memory allocation
res_list=list()

Now generate the combinations

#the outer loop creates the user-defined number of combination attempts
for (i in 1:Nres) {
  #create a "population" from seq1
  popul=seq1
  #pre-allocate an NA vector of the same length as seq1
  res_vec=rep(NA_character_,length(seq1))
  #take FIRST draw from the population
  new_draw=sample(popul,1)
  #remove draw from population
  popul=popul[-match(new_draw,popul)]
  #save new draw
  res_vec[1]=new_draw

  #now take remaining draws
  for (j in 2:length(seq1)) {
    #take new draws as long as
    #1) new_draw is equal to the last draw and
    #2) as long as there are any valid elements left in popul
    while((new_draw==res_vec[j-1])&any(res_vec[j-1]!=popul)) {
      #take new draw
      new_draw=sample(popul,1)
    }
    #if we did not find a valid draw break inner loop
    if (new_draw==res_vec[j-1]) {
      break
    }
    #otherwise save new_draw ...
    res_vec[j]=new_draw
    #... and delete new draw from population
    popul=popul[-match(new_draw,popul)]
  }
  #this is to check whether we had to break the inner loop
  #if not, save results vector
  if (sum(is.na(res_vec[j]))==0) res_list[[length(res_list)+1]]=res_vec
}

Now let's check the results

#for each result vector in res_list:
#1) check whether all subsequent elements are different ---> sum(x[-1]==x[-length(x)])==0
#2) and whether we have the same number of elements as in seq1 ---> all.equal(table(x),table(seq1),check.attributes=FALSE)
sum(sapply(res_list,function(x) (sum(x[-1]==x[-length(x)])==0)&all.equal(table(x),table(seq1),check.attributes=FALSE)))
#6085

#the previous number should be the same as the length of res_list
length(res_list)
#6085

#check the number of unique solutions
length(unique(res_list))
#212
like image 42
cryo111 Avatar answered Oct 25 '22 02:10

cryo111