Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Equivalent of Perl /x (ignore whitespace) mode in R regular expressions

Tags:

regex

r

pcre

Perl has a lovely modifier /x that ignores whitespace in regular expressions. That is to say not that it matches regardless of whitespace but rather that it omits whitespace in the interpretation of the regex unless escaped.

I.e. ^x[0-7][x-z][ABCpuq*]*$ could be written equivalently but much more readably as ^x [0-7] [x-z] [ABCpuq*]*$ in /x mode.

grep and its ilk in R seem to have no such mode, but given their Perl compatibility, is there an option to pass? I've tried a few options but no such luck.

> grepl( "^x[0-7][x-z][ABCpuq*]*$", "x5yuuA" )
[1] TRUE
> grepl( "^x [0-7] [x-z][ABCpuq*]*$", "x5yuuA" )
[1] FALSE
> grepl( "^x [0-7] [x-z][ABCpuq*]*$", "x5yuuA", perl=TRUE )
[1] FALSE
> grepl( "^x [0-7] [x-z][ABCpuq*]*$/x", "x5yuuA", perl=TRUE )
[1] FALSE

Secondary question: How directly do R's Perl-style regexes rely on the C PCRE library? There seems to be a PCRE_Extended setting bit that turns on ignoring whitespace.

like image 401
Ari B. Friedman Avatar asked Jun 14 '14 20:06

Ari B. Friedman


1 Answers

Free-Spacing Mode

In R, to use free-spacing mode for an entire expression, pop the (?x) mode modifier at the beginning of your regex in PCRE mode (perl=TRUE).

Example:

grepl("(?x)  # free spacing\r\n\\d    # a digit\r\n[bc]  # b or c", subject, perl=TRUE);

The (?x) modifier works in most regex flavors. Some exceptions: JavaScript, MySQL, Oracle, VBScript, XPath.

Perl mode and PCRE

How closely does Perl mode rely on PCRE? Entirely. (That's a good thing. See below.)

From R manual:

The perl = TRUE argument to grep, regexpr, gregexpr, sub, gsub and strsplit switches to the PCRE library that implements regular expression pattern matching using the same syntax and semantics as Perl 5.10, with just a few differences.

Some Refinements

  • you can turn on (?x) at any point in the regex
  • you can turn it off with (?-x)
  • you can turn it on for just one set of parentheses, as in (?x: \w \d)

In Praise of PCRE

Having access to PCRE is a good thing.

PCRE is one of the contenders for the title of very best Perl-style engine—along with .NET, Matthew Barnett's regex module for Python, and Perl itself. It is widely used in high-visibility environments (Apache, PHP, Notepad++) so it gets a lot of attention. Among other treats, it gives you access to exotic features such as:

  • Recursion and subroutine calls
  • \K to "Keep Out" what has been matched so far from the returned match
  • Backtracking control: (*SKIP)(*F) and others
  • Branch reset (allowing you to set capture Group #1 at various places)
  • (?(DEFINE)..., which can help you refactor a complex regex
  • Conditionals.

What's missing in PCRE?

  • Infinite-width lookbehinds (as in .NET) would be a terrific addition.
  • So would .NET's really fun balancing groups. That will probably never happen because balancing groups are often seen as recursion's poor brother... However, it allows you to do other things, such as easily setting up counters.
  • Character class subtraction.
  • Some may miss fuzzy matching from Barnett's regex module (can't comment as I haven't used that feature).
like image 78
zx81 Avatar answered Sep 30 '22 19:09

zx81