Base R grep-family is much faster than `stringr` variants when dealing with factors

Question

I have been using stringr since it's supposed to be faster, but I found out today that it's much slower when dealing with factor terms. I didn't see any warning that this would be the case nor why it is.

For example:

string_options = c("OneWord", "TwoWords", "ThreeWords")

sample_chars = sample(string_options, 1e6, replace = TRUE)
sample_facts = as_factor(sample_chars)

When working with character terms, base R is slower than stringr, as expected. But when dealing with factor terms, base R is like 30x faster.

bench::mark(
    base_chars = grepl("Two", sample_chars),
    stringr_chars = str_detect(sample_chars, "Two"),
    base_facts = grepl("Two", sample_facts),
    stringr_facts = str_detect(sample_facts, "Two")
)

# A tibble: 4 × 13
#  expression         min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result            memory             time             gc      
#  <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>            <list>             <list>           <list>  
#1 base_chars     116.1ms 116.38ms      8.58    3.81MB        0     5     0      583ms <lgl [1,000,000]> <Rprofmem [1 × 3]> <bench_tm [5]>   <tibble>
#2 stringr_chars  86.04ms   88.2ms     11.3     3.81MB        0     6     0      532ms <lgl [1,000,000]> <Rprofmem [2 × 3]> <bench_tm [6]>   <tibble>
#3 base_facts      3.59ms   3.65ms    271.     11.44MB        0   136     0      501ms <lgl [1,000,000]> <Rprofmem [3 × 3]> <bench_tm [136]> <tibble>
#4 stringr_facts  90.71ms  91.29ms     10.9    11.44MB        0     6     0      549ms <lgl [1,000,000]> <Rprofmem [3 × 3]> <bench_tm [6]>   <tibble>

It looks like stringr isn't doing anything different with factor terms but base R is significantly optimizing it. Is this expected behaviour? Should I report this as a stringr issue? Is there some stringr setting I'm completely missing? I'd like to not have to think about the format of the data to determine if I'm using stringr or base R.

Andre Wildberg · Accepted Answer

A comparison, also taken stringi into consideration, since stringr is mostly a wrapper around it, basically confirming SamR's and MrFlick's assumptions.

Modified sample_chars and sample_facts to only contain unique values, making the case a little more realistic.

sample_chars <- make.unique(sample_chars)
sample_facts <- as_factor(sample_chars)

Unit: milliseconds
                    expr      min       lq     mean   median       uq       max
              base_chars 649.0219 661.5219 722.4035 693.5823 735.9432 1176.7773
          base_chars_fxd 114.8511 116.5084 125.4108 119.4629 125.0511  201.5845
              base_facts 694.8045 711.7315 769.4673 747.6702 800.0615 1081.8429
          base_facts_fxd 160.9806 165.8422 190.6949 173.8654 214.0473  289.7798
           stringr_chars 411.7234 420.1062 460.1629 437.6956 464.2701  872.4820
           stringr_facts 457.8932 468.9451 510.3969 492.0644 533.0966  711.8716
 stringi_detect_regex_ch 422.2608 432.0969 469.1013 449.2074 475.9340  756.1704
 stringi_detect_regex_fa 468.1434 480.4488 522.0438 504.2079 546.7441  750.5353
   stringi_detect_fxd_ch 118.0593 126.1093 138.3083 133.4033 142.0186  206.6351
   stringi_detect_fxd_fa 163.8008 172.3385 197.7767 181.6919 215.5328  316.5359

benchmark of character detection methods

library(stringr)
library(stringi)
library(microbenchmark)

microbenchmark(
    base_chars = {grepl("Two", sample_chars)},
    base_chars_fxd = {grepl("Two", sample_chars, fixed=T)},
    base_facts = {grepl("Two", sample_facts)},
    base_facts_fxd = {grepl("Two", sample_facts, fixed=T)},
    stringr_chars = {str_detect(sample_chars, "Two")},
    stringr_facts = {str_detect(sample_facts, "Two")},
    stringi_detect_regex_ch = {stri_detect_regex(sample_chars, "Two")},
    stringi_detect_regex_fa = {stri_detect_regex(sample_facts, "Two")},
    stringi_detect_fxd_ch = {stri_detect_fixed(sample_chars, "Two")},
    stringi_detect_fxd_fa = {stri_detect_fixed(sample_facts, "Two")}
)

Base R grep-family is much faster than `stringr` variants when dealing with factors

Tags:

r

stringr

buggaby

1 Answers

Andre Wildberg

Recent Activity

Donate For Us

Base R grep-family is much faster than `stringr` variants when dealing with factors

Tags:

r

stringr

buggaby

1 Answers

Andre Wildberg

Related questions

Recent Activity

Donate For Us