When using base R string functions like gsub
and grep
, is there any downside to, as a matter of habit, always specifying perl = TRUE
?
With perl=TRUE
, the expressions can do more things (e.g, you can use look ahead or look behind assertions, or you can do case conversion with \\U
), and the performance is faster as well, as the documentation states.
So, are there any downsides? Is perl = TRUE
not the default just for backwards compatibility? Are there portability concerns I should be aware of when perl = TRUE
?
As of Perl 5.10, PCRE is also available as a replacement for Perl's default regular-expression engine through the re::engine::PCRE module. The library can be built on Unix, Windows, and several other environments.
PCRE tries to match Perl syntax and semantics as closely as it can. PCRE also supports some alternative regular expression syntax (which does not conflict with the Perl syntax) in order to provide some compatibility with regular expressions in Python, .
grep understands three different versions of regular expression syntax: basic (BRE), extended (ERE), and Perl-compatible (PCRE). In GNU grep , there is no difference in available functionality between the basic and extended syntaxes.
Regular Expression (Regex or Regexp or RE) in Perl is a special text string for describing a search pattern within a given text. Regex in Perl is linked to the host language and is not the same as in PHP, Python, etc. Sometimes it is termed as “Perl 5 Compatible Regular Expressions“.
It is not a good idea to compare apples to oranges, as PCRE regex can do much more than TRE regex enine. Although they share similar constructs, even then appearances might turn out deceitful.
TRE supports literals as PCRE. A literal is either an ordinary character, an 8-bit hex character (like \x1B
), a wide hex character (like \x{263a}
), or an escaped character: \a
, \e
, \f
, \n
, \r
, \t
. PCRE supports more: \cx
("control-x", where x
is any ASCII character), \0dd
(character with octal code 0dd
), \ddd
(character with octal code ddd
, or back reference), \o{ddd..}
(character with octal code ddd..
), \xhh
(character with hex code hh
), \x{hhh..}
(character with hex code hhh..
).
Both have a .
wildcard, but in TRE, it matches any char, in PCRE, it only matches any char but line break char(s) (and which ones depend on the newline convention PCRE verb, (*CR)
, (*LF)
, (*CRLF)
, (*ANYCRLF)
, (*ANY)
). gsub(".+", "~", "_\n_")
will result in ~
, but gsub(".+", "~", "_\n_", perl=TRUE)
will yield ~\n~
. And an opposite example, to make TRE .
act as in PCRE, use (?n)
modifier, gsub("(?n).+", "~", "_\n_")
to yield ~\n~
(with no way to choose among line ending styles). In PCRE patterns, to make .
match line breaks, you need to use (?s)
inline DOTALL modifier before .
(or (?s:.*)
like modifier groups).
Both support an alternation operator, but since TRE is a text-directed engine the longest alternative matches, and in PCRE, the leftmost alternative "wins". sub("(s|su)", "~", "sub")
yields ~b
(as su
is the longest matching alternative), but sub("(s|su)", "~", "sub", perl=TRUE)
produces ~ub
(since s
is first alternative to match).
Both support backreferences, but TRE only supports up to 9 backreferences. If you need 10 or more, use PCRE. sub("(.)\\1(.)\\2(.)\\3(.)\\4(.)\\5(.)\\6(.)\\7(.)\\8(.)\\9(.)\\10", "~", "112233445566778899aa", perl=TRUE)
will find a match, with no perl=TRUE
, no match will be detected.
Both seem to have character classes, [...]
like constructs, but in fact, in the POSIX world where TRE belongs to, these are called bracket expressions. While you may define literal char ranges in both, or specify literal chars with OR relationship between them, one can't use shorthand char classes in bracket expressions, nor any escape sequence. The [\d]+
pattern in a TRE regex is treated as 1 or more backslashes or/and d
letters, while in a PCRE pattern it will be parsed as 1+ digits (try gsub("[\\d]+", "~", "00\\99d")
(-> 00~99~
) and gsub("[\\d]+", "~", "00\\99d", perl=TRUE)
(-> ~\~d
)). This fact will explain why [\]\-\[]+
in a PCRE pattern matches 1+ ]
, -
or [
and does not in a TRE expression where you need to use "smart placing", like [][-]
.
TRE and PCRE support the \d
(digits), \D
(non-digits), \w
("word" chars), \W
("non-word" chars), \s
(any whitespace), \S
(any non-whitespace) shorthand character classes. However, PCRE also supports \v
(any vertical whitespace), \V
(any char other than a vertical whitespace), \h
(any horizontal whitespace), \H
(any character that is not a horizontal whitespace), \N
(any non-newline character), \X
(any Unicode grapheme, useful when processing letters with diacritics), \R
(any Unicode line break sequence).
Both flavors support quantifiers, regular, greedy ?
, *
, +
, lazy ??
, *?
, +?
, range/limiting quantifiers like greedy {3}
, {8,26}
or {3,}
and their lazy counterparts with ?
behind them. Note that TRE has poorer support for limiting quantifiers (it only supports values lower than 256 for {min}
quantifier, and throws "out of memory" exceptions for {2557,}
and bigger values. Make sure you always use the 0
value for the min value if it is what you imply, since {,2}
in TRE actually matches 3 occurrences. However, PCRE supports possessive quantifiers, ++
, ?+
, *+
, {1,5}+
. The patterns quantified with them disallow backtracking into them, once matched, the engine never retries them. Besides, like all other regex libraries based on Henry Spencer's regex library dated back to 1986 (Tcl, PostgreSQL), one should avoid mixing lazy and greedy quantifiers on the same level in the regex, because the first pattern sets the greediness of the whole pattern level and often leads to unexpected results.
Both flavors support POSIX character classes that can be used in between [
...]
. However, TRE supports [:alnum:]
(alphanumeric), [:alpha:]
(letters), [:blank:]
(horizontal whitespace), [:cntrl:]
(control chars), [:digit:]
(digits), [:graph:]
(visible chars, anything except spaces and control characters), [:lower:]
(lowercase letters), [:print:]
(all printable chars), [:punct:]
(symbols and punctuation), [:space:]
(any whitespace), [:upper:]
(uppercase letters) and [:xdigit:]
(chars in hex values). PCRE adds [:word:]
("word" chars) and [:ascii:]
(any ASCII char).
Both support word boundaries, but PCRE patterns do that in a more reliable way. Cf. gsub("\\b", "~", "CODE")
yielding ~C~O~D~E~
and gsub("\\b", "~", "CODE", perl=T)
producing ~CODE~
. Although TRE support specific leading \<
and trailing \>
word boundaries, PCRE \b
are still more reliable.
Both support inline modifiers that change certain pattern behavior when using them inside a pattern, e.g. (?i)
. TRE supports i
(case insensitive), n
(dot no longer matches newline), r
(causes the regex to be matched in a right associative manner rather than the normal left associative manner. By default, concatenation is left associative in TRE, as per the grammar given in the base specifications on regular expressions of Std 1003.1-2001 (POSIX). This flag flips associativity of concatenation to right associative. Associativity can have an effect on how a match is divided into submatches, but does not change what is matched by the entire regexp) and U
(swaps greediness, *?
becomes greedy and *
becomes lazy). PCRE supports i
and U
modifiers, and more: m
(^
and $
match start/end of the line, not the whole string), s
(dot matches newline), x
(allows using whitespace to format the pattern and use comments), J
(allows using names capturing groups with the same name), X
(makes escaping letters with a backslash an error if that combination is not a valid regex token), D
(makes $
only match the very end of string, else, it also matches a position before the final trailing newline in the string) and A
(only match at the start of string, as if there was \A
or ^
in front).
See TRE docs: Matching algorithm used in TRE uses linear worst-case time in the length of the text being searched, and quadratic worst-case time in the length of the used regular expression. In other words, the time complexity of the algorithm is O(M2N), where M is the length of the regular expression and N is the length of the text. That leads to issues with patterns like (.{2,})\1+
to search for duplicate consecutive substrings. See Remove repeated elements in a string with R.
So, when you need to rely on backtracking much, choose PCRE.
The most visible shortcoming of TRE is that it does not support lookarounds. However, there are a lot of things that PCRE can boast of:
(*SKIP)(*FAIL)
PCRE verb combination to match and skip patterns while matching\G
anchor that matches start of string or the end of the previous successful match(?|...|...)
branch reset group allowing capturing groups inside it to share the same IDs\p{...}
and opposite \P{...}
Unicode character properties\L
) or upper case (\U
) (up to the \E
or end of match if it is missing) (actually, it is an extension of the PCRE library used in R)\K
match reset operator (\K
reference)There are more things, like anchors (\A
(start of string), \Z
(end of string), \z
(very end of string)), conditional "if-then-else" construct, atomic groupings (working in the same way as possessive quantifiers, but disallowing backtracking into whole sequences of patterns), etc.
If we want to compare the performance of the TRE and PCRE regex engines in R, we should use simple patterns that match literally the same texts with these 2 engines.
I use R in Windows mostly, but I installed R 3.2.3 on a Linux VM specifically for this testing. The results for MacOS are borrowed from the t.kalinowski's answer.
Let's compare TRE (default) and PCRE (perl=TRUE
) regex performance using microbenchmark library (see more benchmarking options in R):
library(microbenchmark)
The text is a Wikipedia article about butterflies.
txt <- "Butterflies are insects in the macrolepidopteran clade Rhopalocera from the order Lepidoptera, which also includes moths. Adult butterflies have large, often brightly coloured wings, and conspicuous, fluttering flight. The group comprises the large superfamily Papilionoidea, which contains at least one former group, the skippers (formerly the superfamily \"Hesperioidea\") and the most recent analyses suggest it also contains the moth-butterflies (formerly the superfamily \"Hedyloidea\"). Butterfly fossils date to the Paleocene, which was about 56 million years ago."
Let's try and extract the last text inside parentheses with sub
, a very common sub
operation in R:
# sub('.*\\((.*)\\).*', '\\1', txt) # => [1] "formerly the superfamily \"Hedyloidea\"" PCRE_1 <- function(text) { return(sub('.*\\((.*)\\).*', '\\1', txt, perl=TRUE)) } TRE_1 <- function(text) { return(sub('.*\\((.*)\\).*', '\\1', txt)) } test <- microbenchmark( PCRE_1(txt), TRE_1(txt), times = 500000 ) test
The results are the following:
WINDOWS ------- Unit: microseconds expr min lq mean median uq max neval PCRE_1(txt) 163.607 165.418 168.65393 166.625 167.229 7314.588 5e+05 TRE_1(txt) 70.031 72.446 74.53842 73.050 74.257 38026.680 5e+05 MacOS ----- Unit: microseconds expr min lq mean median uq max neval PCRE_1(txt) 31.693 32.857 37.00757 33.413 35.805 43810.177 5e+05 TRE_1(txt) 46.037 47.199 53.06407 47.807 51.981 7702.869 5e+05 Linux ------ Unit: microseconds expr min lq mean median uq max neval PCRE_1(txt) 10.557 11.555 13.78216 12.097 12.662 4301.178 5e+05 TRE_1(txt) 25.875 27.350 31.51925 27.805 28.737 17974.716 5e+05
TRE regex sub
wins only in Windows, more than 2 times as fast. On both MacOS and Linux, PCRE (perl=TRUE
) version wins with a similar ratio.
Now, let's compare the performance of regexps that don't use backtracking that heavily and extract the words inside double quotes:
# regmatches(txt, gregexpr("\"[A-Za-z]+\"", txt)) # => [1] "\"Hesperioidea\"" "\"Hedyloidea\"" PCRE_2 <- function(text) { return(regmatches(txt, gregexpr("\"[A-Za-z]+\"", txt, perl=TRUE))) } TRE_2 <- function(text) { return(regmatches(txt, gregexpr("\"[A-Za-z]+\"", txt))) } test <- microbenchmark( PCRE_2(txt), TRE_2(txt), times = 500000 ) test WINDOWS ------- Unit: microseconds expr min lq mean median uq max neval PCRE_2(txt) 324.799 330.232 349.0281 332.646 336.269 124404.14 5e+05 TRE_2(txt) 187.755 191.981 204.7663 193.792 196.208 74554.94 5e+05 MacOS ----- Unit: microseconds expr min lq mean median uq max neval PCRE_2(txt) 63.801 68.115 75.51773 69.164 71.219 47686.40 5e+05 TRE_2(txt) 63.825 67.849 75.20246 68.883 70.933 49691.92 5e+05 LINUX ----- Unit: microseconds expr min lq mean median uq max neval PCRE_2(txt) 30.199 34.750 44.05169 36.151 43.403 38428.2 5e+05 TRE_2(txt) 37.752 41.854 52.58230 43.409 51.781 38915.7 5e+05
The best mean value belongs to the PCRE regex in Linux, in MacOS, the difference is almost negligent, and in Windows, TRE works much faster.
It is clear that TRE (default) regex library works much faster in Windows. In Linux, PCRE regex is considerably faster. In MacOS, PCRE regex is still preferable since, with backtracking patterns, PCRE regex is faster than TRE in that OS.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With