Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Better way to shuffle elements of a string in R

Tags:

string

r

shuffle

I have to shuffle elements of a string. I wrote a code:

sequ <- "GCTTCG"
set.seed(2017)
i <- sample(1:nchar(sequ))
separate.seq.letters <- unlist(strsplit(sequ, ""))
paste(separate.seq.letters[i], collapse = "")
[1] "GTCGTC"

This code shuffles elements one time. The main question would be is there a better (more effective) way to do that? For very long sequences and huge amount of shuffles strsplit, paste commands takes some extra time.

like image 565
neringab Avatar asked Nov 03 '25 23:11

neringab


2 Answers

Making use of the Rcpp package to handle in C is probably fastest.

Below I've done some benchmarking of a handful of approaches suggested so far, including:

  • Approach in the QUESTION
  • Approach in COMMENT by @akrun
  • Approach using BIOSTRINGS package, suggested by @knb
  • Approach using the STRINGI package, suggested by @Rich
  • A custom RCPP function, based on this post.

Except for the stringi function, here are the others wrapped into functions for testing:

f_question <- function(s) {
  i <- sample(1:nchar(s))
  separate.seq.letters <- unlist(strsplit(s, ""))
  paste(separate.seq.letters[i], collapse = "")
}

f_comment <- function(s) {
  s1 <- unlist(strsplit(s, ""))
  paste(s1[sample(nchar(s))], collapse="")
}

library(Biostrings)
f_biostring <- function(s) {
  probes <- DNAStringSet(s)
  lapply(probes, sample)
}

Rcpp::cppFunction(
  'std::string shuffleString(std::string s) {
    int x = s.length();
    for (int y = x; y > 0; y--) { 
      int pos = rand()%x;
      char tmp = s[y-1];
      s[y-1] = s[pos];
      s[pos] = tmp;
    }
    return s;
  }'
)

For testing, load libraries and write function to generate sequences of length n:

library(microbenchmark)
library(tidyr)
library(ggplot2)

generate_string <- function(n) {
  paste(sample(c("A", "C", "G", "T"), n, replace = TRUE), collapse = "")
}

sequ <- generate_string(10)

# Test example....

sequ
#> [1] "TTATCAAGGC"

f_question(sequ)
#> [1] "CATGGTACAT"
f_comment(sequ)
#> [1] "GATTATAGCC"
f_biostring(sequ)
#> [[1]]
#>   10-letter "DNAString" instance
#> seq: TAGATCGCAT
shuffleString(sequ)
#> [1] "GATTAATCGC"
stringi::stri_rand_shuffle(sequ)
#> [1] "GAAGTCCTTA"

Testing all functions with small n (10 - 100):

ns <- seq(10, 100, by = 10)
times <- sapply(ns, function(n) {
  string <- generate_string(n)

  op <- microbenchmark(
    QUESTION     = f_question(string),
    COMMENT      = f_comment(string),
    BIOSTRING    = f_biostring(string),
    RCPP         = shuffleString(string),
    STRINGI      = stringi::stri_rand_shuffle(string)
  )
  by(op$time, op$expr, function(t) mean(t) / 1000)
})
times <- t(times)
times <- as.data.frame(cbind(times, n = ns))

times <- gather(times, -n, key = "fun", value = "time")
pd <- position_dodge(width = 0.2)
ggplot(times, aes(x = n, y = time, group = fun, color = fun)) +
  geom_point(position = pd) +
  geom_line(position = pd) +
  theme_bw()

enter image description here

Biostrings approach is pretty slow.

Dropping this and moving up to 100 - 1000 (code stays same except ns):

enter image description here

The R-based functions (from the question and comment) are comparable, but falling behind.

Dropping these and moving up to 1000 - 10000:

enter image description here

Looks like the custom Rcpp function is the winner, particularly as the string length grows. However, if choosing between these, consider that the stringi function, stri_rand_shuffle, will be more robust (e.g., better tested and designed to handle corner cases).

like image 178
Simon Jackson Avatar answered Nov 05 '25 15:11

Simon Jackson


You could have a look at stri_rand_shuffle(), from the stringi package. It is written entirely in C and should be very efficient. According to the documentation, it

Generates a (pseudo)random permutation of code points in each string.

Let's try it out:

replicate(5, stringi::stri_rand_shuffle("GCTTCG"))
# [1] "GTTCCG" "CCGTTG" "CTCTGG" "CCGGTT" "GTCGCT"
like image 39
Rich Scriven Avatar answered Nov 05 '25 14:11

Rich Scriven