Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regular expressions in base R: 'perl=TRUE' vs. the default (PCRE vs. TRE)

Tags:

When using base R string functions like gsub and grep, is there any downside to, as a matter of habit, always specifying perl = TRUE?

With perl=TRUE, the expressions can do more things (e.g, you can use look ahead or look behind assertions, or you can do case conversion with \\U), and the performance is faster as well, as the documentation states.

So, are there any downsides? Is perl = TRUE not the default just for backwards compatibility? Are there portability concerns I should be aware of when perl = TRUE?

like image 891
t-kalinowski Avatar asked Nov 11 '17 16:11

t-kalinowski


People also ask

Does Perl use Pcre?

As of Perl 5.10, PCRE is also available as a replacement for Perl's default regular-expression engine through the re::engine::PCRE module. The library can be built on Unix, Windows, and several other environments.

What does Pcre matching do?

PCRE tries to match Perl syntax and semantics as closely as it can. PCRE also supports some alternative regular expression syntax (which does not conflict with the Perl syntax) in order to provide some compatibility with regular expressions in Python, .

Is grep a Pcre?

grep understands three different versions of regular expression syntax: basic (BRE), extended (ERE), and Perl-compatible (PCRE). In GNU grep , there is no difference in available functionality between the basic and extended syntaxes.

What is in Perl regex?

Regular Expression (Regex or Regexp or RE) in Perl is a special text string for describing a search pattern within a given text. Regex in Perl is linked to the host language and is not the same as in PHP, Python, etc. Sometimes it is termed as “Perl 5 Compatible Regular Expressions“.


1 Answers

It is not a good idea to compare apples to oranges, as PCRE regex can do much more than TRE regex enine. Although they share similar constructs, even then appearances might turn out deceitful.

What is similar between TRE and PCRE

TRE supports literals as PCRE. A literal is either an ordinary character, an 8-bit hex character (like \x1B), a wide hex character (like \x{263a}), or an escaped character: \a, \e, \f, \n, \r, \t. PCRE supports more: \cx ("control-x", where x is any ASCII character), \0dd (character with octal code 0dd), \ddd (character with octal code ddd, or back reference), \o{ddd..} (character with octal code ddd..), \xhh (character with hex code hh), \x{hhh..} (character with hex code hhh..).

Both have a . wildcard, but in TRE, it matches any char, in PCRE, it only matches any char but line break char(s) (and which ones depend on the newline convention PCRE verb, (*CR), (*LF), (*CRLF), (*ANYCRLF), (*ANY)). gsub(".+", "~", "_\n_") will result in ~, but gsub(".+", "~", "_\n_", perl=TRUE) will yield ~\n~. And an opposite example, to make TRE . act as in PCRE, use (?n) modifier, gsub("(?n).+", "~", "_\n_") to yield ~\n~ (with no way to choose among line ending styles). In PCRE patterns, to make . match line breaks, you need to use (?s) inline DOTALL modifier before . (or (?s:.*) like modifier groups).

Both support an alternation operator, but since TRE is a text-directed engine the longest alternative matches, and in PCRE, the leftmost alternative "wins". sub("(s|su)", "~", "sub") yields ~b (as su is the longest matching alternative), but sub("(s|su)", "~", "sub", perl=TRUE) produces ~ub (since s is first alternative to match).

Both support backreferences, but TRE only supports up to 9 backreferences. If you need 10 or more, use PCRE. sub("(.)\\1(.)\\2(.)\\3(.)\\4(.)\\5(.)\\6(.)\\7(.)\\8(.)\\9(.)\\10", "~", "112233445566778899aa", perl=TRUE) will find a match, with no perl=TRUE, no match will be detected.

Both seem to have character classes, [...] like constructs, but in fact, in the POSIX world where TRE belongs to, these are called bracket expressions. While you may define literal char ranges in both, or specify literal chars with OR relationship between them, one can't use shorthand char classes in bracket expressions, nor any escape sequence. The [\d]+ pattern in a TRE regex is treated as 1 or more backslashes or/and d letters, while in a PCRE pattern it will be parsed as 1+ digits (try gsub("[\\d]+", "~", "00\\99d") (-> 00~99~) and gsub("[\\d]+", "~", "00\\99d", perl=TRUE) (-> ~\~d)). This fact will explain why [\]\-\[]+ in a PCRE pattern matches 1+ ], - or [ and does not in a TRE expression where you need to use "smart placing", like [][-].

TRE and PCRE support the \d (digits), \D (non-digits), \w ("word" chars), \W ("non-word" chars), \s (any whitespace), \S (any non-whitespace) shorthand character classes. However, PCRE also supports \v (any vertical whitespace), \V (any char other than a vertical whitespace), \h (any horizontal whitespace), \H (any character that is not a horizontal whitespace), \N (any non-newline character), \X (any Unicode grapheme, useful when processing letters with diacritics), \R (any Unicode line break sequence).

Both flavors support quantifiers, regular, greedy ?, *, +, lazy ??, *?, +?, range/limiting quantifiers like greedy {3}, {8,26} or {3,} and their lazy counterparts with ? behind them. Note that TRE has poorer support for limiting quantifiers (it only supports values lower than 256 for {min} quantifier, and throws "out of memory" exceptions for {2557,} and bigger values. Make sure you always use the 0 value for the min value if it is what you imply, since {,2} in TRE actually matches 3 occurrences. However, PCRE supports possessive quantifiers, ++, ?+, *+, {1,5}+. The patterns quantified with them disallow backtracking into them, once matched, the engine never retries them. Besides, like all other regex libraries based on Henry Spencer's regex library dated back to 1986 (Tcl, PostgreSQL), one should avoid mixing lazy and greedy quantifiers on the same level in the regex, because the first pattern sets the greediness of the whole pattern level and often leads to unexpected results.

Both flavors support POSIX character classes that can be used in between [...]. However, TRE supports [:alnum:] (alphanumeric), [:alpha:] (letters), [:blank:] (horizontal whitespace), [:cntrl:] (control chars), [:digit:] (digits), [:graph:] (visible chars, anything except spaces and control characters), [:lower:] (lowercase letters), [:print:] (all printable chars), [:punct:] (symbols and punctuation), [:space:] (any whitespace), [:upper:] (uppercase letters) and [:xdigit:] (chars in hex values). PCRE adds [:word:] ("word" chars) and [:ascii:] (any ASCII char).

Both support word boundaries, but PCRE patterns do that in a more reliable way. Cf. gsub("\\b", "~", "CODE") yielding ~C~O~D~E~ and gsub("\\b", "~", "CODE", perl=T) producing ~CODE~. Although TRE support specific leading \< and trailing \> word boundaries, PCRE \b are still more reliable.

Both support inline modifiers that change certain pattern behavior when using them inside a pattern, e.g. (?i). TRE supports i (case insensitive), n (dot no longer matches newline), r (causes the regex to be matched in a right associative manner rather than the normal left associative manner. By default, concatenation is left associative in TRE, as per the grammar given in the base specifications on regular expressions of Std 1003.1-2001 (POSIX). This flag flips associativity of concatenation to right associative. Associativity can have an effect on how a match is divided into submatches, but does not change what is matched by the entire regexp) and U (swaps greediness, *? becomes greedy and * becomes lazy). PCRE supports i and U modifiers, and more: m (^ and $ match start/end of the line, not the whole string), s (dot matches newline), x (allows using whitespace to format the pattern and use comments), J (allows using names capturing groups with the same name), X (makes escaping letters with a backslash an error if that combination is not a valid regex token), D (makes $ only match the very end of string, else, it also matches a position before the final trailing newline in the string) and A (only match at the start of string, as if there was \A or ^ in front).

Backtracking aspect

See TRE docs: Matching algorithm used in TRE uses linear worst-case time in the length of the text being searched, and quadratic worst-case time in the length of the used regular expression. In other words, the time complexity of the algorithm is O(M2N), where M is the length of the regular expression and N is the length of the text. That leads to issues with patterns like (.{2,})\1+ to search for duplicate consecutive substrings. See Remove repeated elements in a string with R.

So, when you need to rely on backtracking much, choose PCRE.

What can PCRE do and TRE can't

The most visible shortcoming of TRE is that it does not support lookarounds. However, there are a lot of things that PCRE can boast of:

  • (*SKIP)(*FAIL) PCRE verb combination to match and skip patterns while matching
  • Recursion to match whole patterns that can be nested
  • Subroutines to match nested, balanced substrings to match recursive structures
  • \G anchor that matches start of string or the end of the previous successful match
  • (?|...|...) branch reset group allowing capturing groups inside it to share the same IDs
  • \p{...} and opposite \P{...} Unicode character properties
  • Case-changing operators in the replacement patterns turning the whole or part of the match to lower (\L) or upper case (\U) (up to the \E or end of match if it is missing) (actually, it is an extension of the PCRE library used in R)
  • An infinite-width positive lookbehind alternative, \K match reset operator (\K reference)
  • PCRE supports named capturing groups, but they are not widely used in R. Here is a custom example.

There are more things, like anchors (\A (start of string), \Z (end of string), \z (very end of string)), conditional "if-then-else" construct, atomic groupings (working in the same way as possessive quantifiers, but disallowing backtracking into whole sequences of patterns), etc.

Benchmark tests in Windows 7, Linux Ubuntu 16.04, MacOS Sierra 10.12.6

If we want to compare the performance of the TRE and PCRE regex engines in R, we should use simple patterns that match literally the same texts with these 2 engines.

I use R in Windows mostly, but I installed R 3.2.3 on a Linux VM specifically for this testing. The results for MacOS are borrowed from the t.kalinowski's answer.

Let's compare TRE (default) and PCRE (perl=TRUE) regex performance using microbenchmark library (see more benchmarking options in R):

library(microbenchmark) 

The text is a Wikipedia article about butterflies.

txt <- "Butterflies are insects in the macrolepidopteran clade Rhopalocera from the order Lepidoptera, which also includes moths. Adult butterflies have large, often brightly coloured wings, and conspicuous, fluttering flight. The group comprises the large superfamily Papilionoidea, which contains at least one former group, the skippers (formerly the superfamily \"Hesperioidea\") and the most recent analyses suggest it also contains the moth-butterflies (formerly the superfamily \"Hedyloidea\"). Butterfly fossils date to the Paleocene, which was about 56 million years ago." 

Let's try and extract the last text inside parentheses with sub, a very common sub operation in R:

# sub('.*\\((.*)\\).*', '\\1', txt) # => [1] "formerly the superfamily \"Hedyloidea\"" PCRE_1 <- function(text) { return(sub('.*\\((.*)\\).*', '\\1', txt, perl=TRUE)) } TRE_1 <- function(text) { return(sub('.*\\((.*)\\).*', '\\1', txt)) } test <- microbenchmark( PCRE_1(txt), TRE_1(txt), times = 500000 ) test 

The results are the following:

WINDOWS ------- Unit: microseconds         expr     min      lq      mean  median      uq       max neval  PCRE_1(txt) 163.607 165.418 168.65393 166.625 167.229  7314.588 5e+05   TRE_1(txt)  70.031  72.446  74.53842  73.050  74.257 38026.680 5e+05   MacOS  ----- Unit: microseconds          expr    min     lq     mean median     uq       max neval   PCRE_1(txt) 31.693 32.857 37.00757 33.413 35.805 43810.177 5e+05    TRE_1(txt) 46.037 47.199 53.06407 47.807 51.981  7702.869 5e+05  Linux ------ Unit: microseconds         expr    min     lq     mean median     uq       max neval  PCRE_1(txt) 10.557 11.555 13.78216 12.097 12.662  4301.178 5e+05   TRE_1(txt) 25.875 27.350 31.51925 27.805 28.737 17974.716 5e+05 

TRE regex sub wins only in Windows, more than 2 times as fast. On both MacOS and Linux, PCRE (perl=TRUE) version wins with a similar ratio.

Now, let's compare the performance of regexps that don't use backtracking that heavily and extract the words inside double quotes:

# regmatches(txt, gregexpr("\"[A-Za-z]+\"", txt)) # => [1] "\"Hesperioidea\"" "\"Hedyloidea\"" PCRE_2 <- function(text) { return(regmatches(txt, gregexpr("\"[A-Za-z]+\"", txt, perl=TRUE))) } TRE_2 <- function(text) { return(regmatches(txt, gregexpr("\"[A-Za-z]+\"", txt))) } test <- microbenchmark( PCRE_2(txt), TRE_2(txt), times = 500000 ) test  WINDOWS ------- Unit: microseconds         expr     min      lq     mean  median      uq       max neval  PCRE_2(txt) 324.799 330.232 349.0281 332.646 336.269 124404.14 5e+05   TRE_2(txt) 187.755 191.981 204.7663 193.792 196.208  74554.94 5e+05  MacOS ----- Unit: microseconds          expr    min     lq     mean median     uq      max neval   PCRE_2(txt) 63.801 68.115 75.51773 69.164 71.219 47686.40 5e+05    TRE_2(txt) 63.825 67.849 75.20246 68.883 70.933 49691.92 5e+05  LINUX ----- Unit: microseconds         expr    min     lq     mean median     uq     max neval  PCRE_2(txt) 30.199 34.750 44.05169 36.151 43.403 38428.2 5e+05   TRE_2(txt) 37.752 41.854 52.58230 43.409 51.781 38915.7 5e+05 

The best mean value belongs to the PCRE regex in Linux, in MacOS, the difference is almost negligent, and in Windows, TRE works much faster.

Summary

It is clear that TRE (default) regex library works much faster in Windows. In Linux, PCRE regex is considerably faster. In MacOS, PCRE regex is still preferable since, with backtracking patterns, PCRE regex is faster than TRE in that OS.

like image 63
Wiktor Stribiżew Avatar answered Sep 27 '22 19:09

Wiktor Stribiżew