How to efficiently sort the characters in a string in R?

Tags:

How can I efficiently sort the characters of each string in a vector? For example, given a vector of strings:

set.seed(1)
strings <- c(do.call(paste0, replicate(4, sample(LETTERS, 10000, TRUE), FALSE)),
do.call(paste0, replicate(3, sample(LETTERS, 10000, TRUE), FALSE)),
do.call(paste0, replicate(2, sample(LETTERS, 10000, TRUE), FALSE)))

I have written a function that will split each string into a vector, sort the vector, and then collapse the output:

sort_cat <- function(strings){
  tmp <- strsplit(strings, split="")
  tmp <- lapply(tmp, sort)
  tmp <- lapply(tmp, paste0, collapse = "")
  tmp <- unlist(tmp)
  return(tmp)
}
sorted_strings <- sort_cat(strings)

However, the vector of strings I need to apply this to is very long, and this function is too slow. Does anyone have any suggestions for how to improve performance?

310

asked Oct 08 '19 13:10

Powege

1 Answers

Re-implementing using stringi gives a roughly 4x speedup. I also edited sort_cat to use fixed = TRUE in the strsplit, which makes it a little faster. And thanks to Carl for the single loop suggestion, which speeds us up just a little bit more.

sort_cat <- function(strings){
  tmp <- strsplit(strings, split="", fixed = TRUE)
  tmp <- lapply(tmp, sort)
  tmp <- lapply(tmp, paste0, collapse = "")
  tmp <- unlist(tmp)
  return(tmp)
}

library(stringi)
sort_stringi = function(s) {
  s = stri_split_boundaries(s, type = "character")
  s = lapply(s, stri_sort)
  s = lapply(s, stri_join, collapse = "")
  unlist(s)
}

sort_stringi_loop = function(s) {
  s = stri_split_boundaries(s, type = "character")
  for (i in seq_along(s)) {
    s[[i]] = stri_join(stri_sort(s[[i]]), collapse = "")
  }
  unlist(s)
}

bench::mark(
  sort_cat(strings),
  sort_stringi(strings),
  sort_stringi_loop(strings)
)
# # A tibble: 3 x 13
#   expression                    min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory
#   <bch:expr>                 <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>
# 1 sort_cat(strings)          23.01s 23.01s    0.0435    31.2MB     2.17     1    50     23.01s <chr ~ <Rpro~
# 2 sort_stringi(strings)       6.16s  6.16s    0.162     30.5MB     2.11     1    13      6.16s <chr ~ <Rpro~
# 3 sort_stringi_loop(strings)  5.75s  5.75s    0.174     15.3MB     1.74     1    10      5.75s <chr ~ <Rpro~
# # ... with 2 more variables: time <list>, gc <list>

This method could also be used in parallel. Profiling the code to see which operations actually take the longest would be a good next step if you want to go even faster.

answered Oct 08 '22 01:10

Gregor Thomas

Related questions
                            
                                Calculate probability from density function
                            
                                3d scatter plot remains empty in R plotly
                            
                                How to "addTiles" on top of "addPolygons" in R's Leaflet?
                            
                                Evaluating R code in YAML header
                            
                                Include "All other functions" in a pkgdown reference yaml
                            
                                Using LASSO in R with categorical variables
                            
                                Simple way to visualise odds ratios in R
                            
                                Does dplyr::mutate() not recycle vectors?
                            
                                Properly License R Package that Includes Other MIT Code
                            
                                R equivalent of microbenchmark that includes memory as well as runtime
                            
                                Install keras and tensorflow using Rstudio
                            
                                Create n by n matrix with unique values from 1:n
                            
                                Package build ignores Makevars flags
                            
                                R: Finding the intersect of two lines
                            
                                Checkboxes in DT shiny
                            
                                updating Rgdal in R.3.5.1 C++11 dependency... although C++11 is available
                            
                                Using ggplot to plot line segments and points together
                            
                                Unexpected behaviour in ggplot2 pie chart labeling
                            
                                Creating geom / stat from scratch
                            
                                How to use saveRDS(..., refhook = ) parameter?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to efficiently sort the characters in a string in R?

Tags:

performance

string

sorting

r

Powege

People also ask

1 Answers

Gregor Thomas

Recent Activity

Donate For Us