I have been using stringr since it's supposed to be faster, but I found out today that it's much slower when dealing with factor terms. I didn't see any warning that this would be the case nor why it is.
For example:
string_options = c("OneWord", "TwoWords", "ThreeWords")
sample_chars = sample(string_options, 1e6, replace = TRUE)
sample_facts = as_factor(sample_chars)
When working with character terms, base R is slower than stringr, as expected. But when dealing with factor terms, base R is like 30x faster.
bench::mark(
base_chars = grepl("Two", sample_chars),
stringr_chars = str_detect(sample_chars, "Two"),
base_facts = grepl("Two", sample_facts),
stringr_facts = str_detect(sample_facts, "Two")
)
# A tibble: 4 × 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
#1 base_chars 116.1ms 116.38ms 8.58 3.81MB 0 5 0 583ms <lgl [1,000,000]> <Rprofmem [1 × 3]> <bench_tm [5]> <tibble>
#2 stringr_chars 86.04ms 88.2ms 11.3 3.81MB 0 6 0 532ms <lgl [1,000,000]> <Rprofmem [2 × 3]> <bench_tm [6]> <tibble>
#3 base_facts 3.59ms 3.65ms 271. 11.44MB 0 136 0 501ms <lgl [1,000,000]> <Rprofmem [3 × 3]> <bench_tm [136]> <tibble>
#4 stringr_facts 90.71ms 91.29ms 10.9 11.44MB 0 6 0 549ms <lgl [1,000,000]> <Rprofmem [3 × 3]> <bench_tm [6]> <tibble>
It looks like stringr isn't doing anything different with factor terms but base R is significantly optimizing it. Is this expected behaviour? Should I report this as a stringr issue? Is there some stringr setting I'm completely missing? I'd like to not have to think about the format of the data to determine if I'm using stringr or base R.
A comparison, also taken stringi into consideration, since stringr is mostly a wrapper around it, basically confirming SamR's and MrFlick's assumptions.
Modified sample_chars and sample_facts to only contain unique values, making the case a little more realistic.
sample_chars <- make.unique(sample_chars)
sample_facts <- as_factor(sample_chars)
Unit: milliseconds
expr min lq mean median uq max
base_chars 649.0219 661.5219 722.4035 693.5823 735.9432 1176.7773
base_chars_fxd 114.8511 116.5084 125.4108 119.4629 125.0511 201.5845
base_facts 694.8045 711.7315 769.4673 747.6702 800.0615 1081.8429
base_facts_fxd 160.9806 165.8422 190.6949 173.8654 214.0473 289.7798
stringr_chars 411.7234 420.1062 460.1629 437.6956 464.2701 872.4820
stringr_facts 457.8932 468.9451 510.3969 492.0644 533.0966 711.8716
stringi_detect_regex_ch 422.2608 432.0969 469.1013 449.2074 475.9340 756.1704
stringi_detect_regex_fa 468.1434 480.4488 522.0438 504.2079 546.7441 750.5353
stringi_detect_fxd_ch 118.0593 126.1093 138.3083 133.4033 142.0186 206.6351
stringi_detect_fxd_fa 163.8008 172.3385 197.7767 181.6919 215.5328 316.5359

library(stringr)
library(stringi)
library(microbenchmark)
microbenchmark(
base_chars = {grepl("Two", sample_chars)},
base_chars_fxd = {grepl("Two", sample_chars, fixed=T)},
base_facts = {grepl("Two", sample_facts)},
base_facts_fxd = {grepl("Two", sample_facts, fixed=T)},
stringr_chars = {str_detect(sample_chars, "Two")},
stringr_facts = {str_detect(sample_facts, "Two")},
stringi_detect_regex_ch = {stri_detect_regex(sample_chars, "Two")},
stringi_detect_regex_fa = {stri_detect_regex(sample_facts, "Two")},
stringi_detect_fxd_ch = {stri_detect_fixed(sample_chars, "Two")},
stringi_detect_fxd_fa = {stri_detect_fixed(sample_facts, "Two")}
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With