Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shuffle Vector in R, But Identical Elements Should Have Minimum Distance

I want to randomize/shuffle a vector. Some of the vector elements are identical. After shuffling, identical elements should have a minimum distance of three (i.e. two other elements should be between identical elements).

Consider the following example vector in R:

x <- rep(LETTERS[1:5], 3)  # Create example vector
x
#  [1] "A" "B" "C" "D" "E" "A" "B" "C" "D" "E" "A" "B" "C" "D" "E"

If I shuffle my vector using the sample function, some of the identical elements may be too close together. For instance, if I use the following R code, the element "C" appears directly after each other at positions 5 and 6:

set.seed(53135)
sample(x)                  # sample() function puts same elements too close
#  [1] "B" "A" "E" "D" "C" "C" "E" "A" "B" "C" "D" "E" "A" "D" "B"

How could I ensure that identical elements have a minimum distance of three?

like image 358
Joachim Schork Avatar asked Nov 16 '20 09:11

Joachim Schork


People also ask

How to shuffle a vector randomly using the sample function?

This Section explains how to shuffle our numeric vector randomly using the sample function. First, we are setting a seed for reproducibility. Then, we are applying the sample function to create a new data object called x_rand.

How many elements are in a vector in RStudio?

The previous output of the RStudio console shows that our first example vector has three vector elements: The characters A, B, and C. Our second example vector also contains three elements.

When to apply the R code with the same vector?

Note that we can apply this R code only in case that both vectors have the same length (or in case the length of one vector is a multiplier of the length of the other vector). This condition does not have to be fulfilled in the next example…

How to test for equality element by element in RStudio?

The RStudio console returns the logical value FALSE, i.e. our two vectors are not identical. This example explains how to test for equality element by element using the == operator. The first and the second element are the same, but the third element is different.


Video Answer


3 Answers

So basically we need to conditionally sample one element from the x vector that have not been selected in the min.dist-1 runs. Using purrr's reduce we can achieve this:

min.dist <- 2
reduce(integer(length(x)-1), ~c(.x, sample(x[!x %in% tail(.x, min.dist)], 1)), .init=sample(x,1))
[1] "A" "E" "D" "B" "A" "D" "E" "C" "D" "A" "C" "E" "B" "A" "E"

Bundled in a function

shuffle <- function(x, min.dist=2){
    stopifnot(min.dist < length(unique(x)))
    reduce(integer(length(x)-1), ~c(.x, sample(x[!x %in% tail(.x, min.dist)], 1)), .init=sample(x,1))
}
> shuffle(x, 3)
 [1] "A" "C" "B" "D" "E" "A" "B" "C" "E" "D" "A" "B" "C" "E" "A"
> shuffle(x, 3)
 [1] "A" "B" "D" "E" "C" "A" "B" "D" "E" "C" "A" "D" "E" "C" "A"
> shuffle(x, 4)
 [1] "C" "E" "D" "A" "B" "C" "E" "D" "A" "B" "C" "E" "D" "A" "B"
> shuffle(x, 4)
 [1] "A" "B" "D" "E" "C" "A" "B" "D" "E" "C" "A" "B" "D" "E" "C"
> shuffle(x, 2)
 [1] "E" "A" "D" "E" "B" "D" "A" "E" "C" "D" "A" "E" "C" "A" "B"
> shuffle(x, 2)
 [1] "B" "A" "D" "C" "B" "A" "E" "B" "A" "E" "B" "C" "D" "A" "E"

after @27ϕ9 comment:

shuffle <- function(x, min.dist=2){
    stopifnot(min.dist < length(unique(x)))
    reduce(integer(length(x)-1), ~ c(.x, sample(x[!x %in% tail(.x, min.dist) &( x %in% names(t <- table(x[x%in%.x]) > table(.x))[t] | !x %in% .x)], 1)), .init=sample(x,1))
}
> table(shuffle(rep(LETTERS[1:5], 3),2))

A B C D E 
3 3 3 3 3 
> table(shuffle(rep(LETTERS[1:5], 3),2))
Error in sample.int(length(x), size, replace, prob) : 
  invalid first argument

UPDATE

After some trial and error, looking at the fact that not always you're gonna have enough elements to space out the min.dist I came up with a solution this code is the most explained from the ones above :

shuffle <- function(x, min.dist=2){
    stopifnot(min.dist < length(unique(x)))
    reduce(integer(length(x)-1), function(.x, ...){
        # whether the value is in the tail of the aggregated vector
        in.tail <- x %in% tail(.x, min.dist)
        # whether a value still hasn't reached the max frequency
        freq.got <- x %in% names(t<-table(x[x%in%.x]) > table(.x))[t]
        # whether a value isn't in the aggregated vector
        yet <- !x %in% .x
        # the if is there basically to account for the cases when we don't have enough vars to space out the vectors
         c(.x, if(any((!in.tail & freq.got) | yet )) sample(x[(!in.tail & freq.got) | yet ], 1) else  x[which(freq.got)[1]] )
    }, .init=sample(x,1))
}

now running the table(shuffle(rep(LETTERS[1:5], 3),2)) will always return 3 for all vars and we can say with some certainty that in the vector the variables are spaced with a minimum distance of 2. the only way to guarantee that no elements are duplicated is by using min.dist=length(unique(x))-1 otherwise there will be instances where at maximum r < min.dist elements are not min.dist distanced from their last occurrences, and if such elements exist they're going to be in the length(x) + 1 - 1:min.dist subset of the resulting vector.

Just to be completely certain using a loop to check whether tail of the output vector has unique values: (remove the print statement I used it just for demonstration purposes)

shuffler <- function(x, min.dist=2){
    while(!length(unique(print(tail(l<-shuffle(x, min.dist=min.dist), min.dist+1))))==min.dist+1){}
    l
}

table(print(shuffler(rep(LETTERS[1:5], 3),2)))
 [1] "A" "B" "C" "E" "B" "C" "D" "A" "C" "D" "A" "E" "B" "D" "E"

A B C D E 
3 3 3 3 3 

table(print(shuffler(rep(LETTERS[1:5], 3),2)))
[1] "D" "C" "C"
[1] "C" "C" "E"
[1] "C" "A" "C"
[1] "D" "B" "D"
[1] "B" "E" "D"
 [1] "C" "A" "E" "D" "A" "B" "C" "E" "A" "B" "D" "C" "B" "E" "D"

A B C D E 
3 3 3 3 3 

Update:

shuffler <- function(x, min.dist=2){
    while(any(unlist(lapply(unique(tl<-tail(l<-shuffle(x, min.dist=min.dist), 2*min.dist)), function(x) diff(which(tl==x))<=min.dist)))){}
    l
}

this new version does a rigorous test on whether the elements in the tail of the vector are min.distanced, the previous version works for min.dist=2, however this new version does better testing.

like image 52
Abdessabour Mtk Avatar answered Oct 23 '22 05:10

Abdessabour Mtk


If your data is large, then it may be (way) faster to rely on probability to do that kind of task.

Here's an example:

prob_shuffler = function(x, min.dist = 2){
    n = length(x)
    res = sample(x)
    OK = FALSE
    
    # We loop until we have a solution
    while(!OK){
        OK = TRUE
        for(i in 1:min.dist){
            # We check if identical elements are 'i' steps away
            pblm = res[1:(n-i)] == res[-(1:i)]
            if(any(pblm)){
                if(sum(pblm) >= (n - i)/2){
                    # back to square 1
                    res = sample(x)
                } else {
                    # we pair each identical element with 
                    # an extra one
                    extra = sample(which(!pblm), sum(pblm))
                    id_reshuffle = c(which(pblm), extra)
                    res[id_reshuffle] = sample(res[id_reshuffle])
                }

                # We recheck from the beginning
                OK = FALSE
                break
            }
        }
    }

    res
}

Even though the while loop looks scary, in practice convergence is fast. Of course, the lower the probability to have two characters at min.dist away, the faster the convergence.

The current solutions by @Abdessabour Mtk and @Carles Sans Fuentes work but, depending on the size of the input data, quickly become prohibitively slow. Here's a benchmark:

library(microbenchmark)

x = rep(c(letters, LETTERS), 10)
length(x)
#> [1] 520

microbenchmark(prob_shuffler(x, 1), shuffler_am(x, 1), shuffler_csf(x, 1), times = 10)
#> Unit: microseconds
#>                 expr       min        lq       mean    median        uq        max neval
#>  prob_shuffler(x, 1)    87.001   111.501    155.071   131.801   192.401    264.401    10
#>    shuffler_am(x, 1) 17218.100 18041.900  20324.301 18740.351 22296.301  26495.200    10
#>   shuffler_csf(x, 1) 86771.401 88550.501 118185.581 95582.001 98781.601 341826.701    10

microbenchmark(prob_shuffler(x, 2), shuffler_am(x, 2), shuffler_csf(x, 2), times = 10)
#> Unit: microseconds
#>                 expr     min        lq       mean    median        uq        max neval
#>  prob_shuffler(x, 2)   140.1   195.201   236.3312   245.252   263.202    354.101    10
#>    shuffler_am(x, 2) 18886.2 19526.901 22967.6409 21021.151 26758.800  29133.400    10
#>   shuffler_csf(x, 2) 86078.1 92209.901 97151.0609 97612.251 99850.101 107981.401    10

microbenchmark(prob_shuffler(x, 3), shuffler_am(x, 3), shuffler_csf(x, 3), times = 10)
#> Unit: microseconds
#>                 expr       min        lq        mean     median       uq        max neval
#>  prob_shuffler(x, 3)   318.001   450.402    631.5312    573.352    782.2   1070.401    10
#>    shuffler_am(x, 3) 19003.501 19622.300  23314.4808  20784.551  28281.5  32885.101    10
#>   shuffler_csf(x, 3) 87692.701 96152.202 101233.5411 100925.201 108034.7 113814.901    10

We can remark two things: a) in all logic, the speed of prob_shuffler depends on min.dist while the other methods not so much, b) prob_shuffler is about 100-fold faster for just 520 observations (and it scales).

Of course if the probability to have two identical characters at min.dist away is extremely high, then the recursive methods should be faster. But in most practical cases, the probability method is faster.

like image 21
Laurent Bergé Avatar answered Oct 23 '22 05:10

Laurent Bergé


I hope this answer works fine for you. It is done with base R, but it works. I leave the printing if you want to check line by line:

x <- rep(LETTERS[1:5], 3)  # Create example vector


shuffle <- function(x, min_dist=3){
  #init variables   
  result<-c() # result vector
  count<-0
  vec_use<-x
  vec_keep<-c()
  for(i in 1:length(x)){
#    print(paste0("iteration =", i))
    if (count>min_dist){
      valback<-vec_keep[1]
#      print(paste0("value to be returned:",  valback))
      ntimes_valback<-(table(vec_keep)[valback])
      vec_use<- c(vec_use,rep(valback,ntimes_valback))
#      print(paste0("vec_use after giving back valbak =", valback))
#      print(paste0(vec_use,","))
      vec_keep <- vec_keep[!vec_keep %in% valback]
#      print(paste0("vec_keep after removing valback =", valback))
#      print(paste0(vec_keep,","))
    }
    val<-sample(vec_use,1)
#    print(paste0("val = ",val))#remove value
    vec_keep<- c(vec_keep,x[x %in% val])
    vec_keep<-vec_keep[1:(length(vec_keep)-1)]#removing 1 letter
#    print(paste0("vec_keep ="))
#    print(paste0(vec_keep,","))
    vec_use <- vec_use[!vec_use %in% val]
#    print(paste0("vec_use ="))
#    print(paste0(vec_use,","))
    result[i]<-val
    count<-count+1
    }
  return(result)
}
shuffle(x)
"C" "D" "B" "E" "C" "A" "B" "D" "E" "A" "C" "D" "B" "E" "C"
like image 3
Carles S Avatar answered Oct 23 '22 07:10

Carles S