When does setting 'perl=TRUE' in 'strsplit' does not work (as intended or at all)?

Question

I just did some benchmarking while trying to optimise some code and observed that strsplit with perl=TRUE is faster than running strsplit with perl=FALSE. For example,

set.seed(1)
ff <- function() paste(sample(10), collapse= " ")
xx <- replicate(1e5, ff())

system.time(t1 <- strsplit(xx, "[ ]"))
#  user  system elapsed 
# 1.246   0.002   1.268 

system.time(t2 <- strsplit(xx, "[ ]", perl=TRUE))
#  user  system elapsed 
# 0.389   0.001   0.392 

identical(t1, t2) 
# [1] TRUE

So my question (or rather a variation of the question in the title) is, under what circumstances would be absolutely need perl=FALSE (leaving out the fixed and useBytes parameters)? In other words, what can't we do using perl=TRUE that can be done by setting perl=FALSE?

Ricardo Saporta · Accepted Answer

from the documentation ;)

Performance considerations

If you are doing a lot of regular expression matching, including on very long strings, you will want to consider the options used. Generally PCRE will be faster than the default regular expression engine, and fixed = TRUE faster still (especially when each pattern is matched only a few times).

Of course, this does not answer the question of "are there any dangers to always using perl=TRUE"

When does setting 'perl=TRUE' in 'strsplit' does not work (as intended or at all)?

Tags:

regex

r

pcre

Arun

Video Answer

1 Answers

Ricardo Saporta

Recent Activity

Donate For Us

When does setting 'perl=TRUE' in 'strsplit' does not work (as intended or at all)?

Tags:

regex

r

pcre

Arun

Video Answer

1 Answers

Ricardo Saporta

Related questions

Recent Activity

Donate For Us