How can I efficiently sort the characters of each string in a vector? For example, given a vector of strings:
set.seed(1)
strings <- c(do.call(paste0, replicate(4, sample(LETTERS, 10000, TRUE), FALSE)),
do.call(paste0, replicate(3, sample(LETTERS, 10000, TRUE), FALSE)),
do.call(paste0, replicate(2, sample(LETTERS, 10000, TRUE), FALSE)))
I have written a function that will split each string into a vector, sort the vector, and then collapse the output:
sort_cat <- function(strings){
tmp <- strsplit(strings, split="")
tmp <- lapply(tmp, sort)
tmp <- lapply(tmp, paste0, collapse = "")
tmp <- unlist(tmp)
return(tmp)
}
sorted_strings <- sort_cat(strings)
However, the vector of strings I need to apply this to is very long, and this function is too slow. Does anyone have any suggestions for how to improve performance?
order() in R The numbers are ordered according to its index by using order(x) . Here the order() will sort the given numbers according to its index in the ascending order.
To get the first n characters from a string, we can use the built-in substr() function in R. The substr() function takes 3 arguments, the first one is a string, the second is start position, third is end position. Note: The negative values count backward from the last character.
For example, to sort the string in descending order, use the std::greater<>() function object, which delegates the call to operator> . Since C++14, we can skip the template arguments to std::greater function object. i.e., std::greater<>() will work. That's all about sorting the characters of a string in C++.
Re-implementing using stringi
gives a roughly 4x speedup. I also edited sort_cat
to use fixed = TRUE
in the strsplit
, which makes it a little faster. And thanks to Carl for the single loop suggestion, which speeds us up just a little bit more.
sort_cat <- function(strings){
tmp <- strsplit(strings, split="", fixed = TRUE)
tmp <- lapply(tmp, sort)
tmp <- lapply(tmp, paste0, collapse = "")
tmp <- unlist(tmp)
return(tmp)
}
library(stringi)
sort_stringi = function(s) {
s = stri_split_boundaries(s, type = "character")
s = lapply(s, stri_sort)
s = lapply(s, stri_join, collapse = "")
unlist(s)
}
sort_stringi_loop = function(s) {
s = stri_split_boundaries(s, type = "character")
for (i in seq_along(s)) {
s[[i]] = stri_join(stri_sort(s[[i]]), collapse = "")
}
unlist(s)
}
bench::mark(
sort_cat(strings),
sort_stringi(strings),
sort_stringi_loop(strings)
)
# # A tibble: 3 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory
# <bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list>
# 1 sort_cat(strings) 23.01s 23.01s 0.0435 31.2MB 2.17 1 50 23.01s <chr ~ <Rpro~
# 2 sort_stringi(strings) 6.16s 6.16s 0.162 30.5MB 2.11 1 13 6.16s <chr ~ <Rpro~
# 3 sort_stringi_loop(strings) 5.75s 5.75s 0.174 15.3MB 1.74 1 10 5.75s <chr ~ <Rpro~
# # ... with 2 more variables: time <list>, gc <list>
This method could also be used in parallel. Profiling the code to see which operations actually take the longest would be a good next step if you want to go even faster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With