Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently sort the characters in a string in R?

How can I efficiently sort the characters of each string in a vector? For example, given a vector of strings:

set.seed(1)
strings <- c(do.call(paste0, replicate(4, sample(LETTERS, 10000, TRUE), FALSE)),
do.call(paste0, replicate(3, sample(LETTERS, 10000, TRUE), FALSE)),
do.call(paste0, replicate(2, sample(LETTERS, 10000, TRUE), FALSE)))

I have written a function that will split each string into a vector, sort the vector, and then collapse the output:

sort_cat <- function(strings){
  tmp <- strsplit(strings, split="")
  tmp <- lapply(tmp, sort)
  tmp <- lapply(tmp, paste0, collapse = "")
  tmp <- unlist(tmp)
  return(tmp)
}
sorted_strings <- sort_cat(strings)

However, the vector of strings I need to apply this to is very long, and this function is too slow. Does anyone have any suggestions for how to improve performance?

like image 310
Powege Avatar asked Oct 08 '19 13:10

Powege


People also ask

How do I sort a specific order in R?

order() in R The numbers are ordered according to its index by using order(x) . Here the order() will sort the given numbers according to its index in the ascending order.

How do you select the first three characters in R?

To get the first n characters from a string, we can use the built-in substr() function in R. The substr() function takes 3 arguments, the first one is a string, the second is start position, third is end position. Note: The negative values count backward from the last character.

How do you sort a string by function?

For example, to sort the string in descending order, use the std::greater<>() function object, which delegates the call to operator> . Since C++14, we can skip the template arguments to std::greater function object. i.e., std::greater<>() will work. That's all about sorting the characters of a string in C++.


1 Answers

Re-implementing using stringi gives a roughly 4x speedup. I also edited sort_cat to use fixed = TRUE in the strsplit, which makes it a little faster. And thanks to Carl for the single loop suggestion, which speeds us up just a little bit more.

sort_cat <- function(strings){
  tmp <- strsplit(strings, split="", fixed = TRUE)
  tmp <- lapply(tmp, sort)
  tmp <- lapply(tmp, paste0, collapse = "")
  tmp <- unlist(tmp)
  return(tmp)
}

library(stringi)
sort_stringi = function(s) {
  s = stri_split_boundaries(s, type = "character")
  s = lapply(s, stri_sort)
  s = lapply(s, stri_join, collapse = "")
  unlist(s)
}

sort_stringi_loop = function(s) {
  s = stri_split_boundaries(s, type = "character")
  for (i in seq_along(s)) {
    s[[i]] = stri_join(stri_sort(s[[i]]), collapse = "")
  }
  unlist(s)
}

bench::mark(
  sort_cat(strings),
  sort_stringi(strings),
  sort_stringi_loop(strings)
)
# # A tibble: 3 x 13
#   expression                    min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory
#   <bch:expr>                 <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>
# 1 sort_cat(strings)          23.01s 23.01s    0.0435    31.2MB     2.17     1    50     23.01s <chr ~ <Rpro~
# 2 sort_stringi(strings)       6.16s  6.16s    0.162     30.5MB     2.11     1    13      6.16s <chr ~ <Rpro~
# 3 sort_stringi_loop(strings)  5.75s  5.75s    0.174     15.3MB     1.74     1    10      5.75s <chr ~ <Rpro~
# # ... with 2 more variables: time <list>, gc <list>

This method could also be used in parallel. Profiling the code to see which operations actually take the longest would be a good next step if you want to go even faster.

like image 87
Gregor Thomas Avatar answered Oct 08 '22 01:10

Gregor Thomas