Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When does setting 'perl=TRUE' in 'strsplit' does not work (as intended or at all)?

Tags:

regex

r

pcre

I just did some benchmarking while trying to optimise some code and observed that strsplit with perl=TRUE is faster than running strsplit with perl=FALSE. For example,

set.seed(1)
ff <- function() paste(sample(10), collapse= " ")
xx <- replicate(1e5, ff())

system.time(t1 <- strsplit(xx, "[ ]"))
#  user  system elapsed 
# 1.246   0.002   1.268 

system.time(t2 <- strsplit(xx, "[ ]", perl=TRUE))
#  user  system elapsed 
# 0.389   0.001   0.392 

identical(t1, t2) 
# [1] TRUE

So my question (or rather a variation of the question in the title) is, under what circumstances would be absolutely need perl=FALSE (leaving out the fixed and useBytes parameters)? In other words, what can't we do using perl=TRUE that can be done by setting perl=FALSE?

like image 922
Arun Avatar asked Jul 20 '13 00:07

Arun


Video Answer


1 Answers

from the documentation ;)

Performance considerations

If you are doing a lot of regular expression matching, including on very long strings, you will want to consider the options used. Generally PCRE will be faster than the default regular expression engine, and fixed = TRUE faster still (especially when each pattern is matched only a few times).

Of course, this does not answer the question of "are there any dangers to always using perl=TRUE"

like image 146
Ricardo Saporta Avatar answered Sep 21 '22 06:09

Ricardo Saporta