Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which regular expression engine type does R use as a standard?

Tags:

regex

r

Jeffrey Friedl lists 3 main types of regex engines in his book "Mastering Regular Expressions":

  • Traditional NFA
  • POSIX NFA
  • DFA (POSIX or not)

Which of these does R use as a standard?

like image 668
histelheim Avatar asked Jun 24 '15 19:06

histelheim


People also ask

What regex engine does r use?

By default R uses POSIX extended regular expressions, though if extended is set to FALSE , it will use basic POSIX regular expressions. If perl is set to TRUE , R will use the Perl 5 flavor of regular expressions as implemented in the PCRE library.

What is a regular expression engine?

A regex engine executes the regex one character at a time in left-to-right order. This input string itself is parsed one character at a time, in left-to-right order. Once a character is matched, it's said to be consumed from the input, and the engine moves to the next input character. The engine is by default greedy.

Is there a standard regex?

Standard Regular Expression Strings. Regular expressions (RegEx) are a powerful way of matching a sequence of simple characters. You can use regular expressions in Forcepoint Email Security Cloud to create dictionary entries for lexical rules (see Filtering using lexical rules).

What regex engine does Python use?

Python has two major implementations, the built in re and the regex library. Ruby 1.8, Ruby 1.9, and Ruby 2.0 and later versions use different engines; Ruby 1.9 integrates Oniguruma, Ruby 2.0 and later integrate Onigmo, a fork from Oniguruma. The primary regex crate does not allow look-around expressions.

What are different types of regular expression?

There are also two types of regular expressions: the "Basic" regular expression, and the "extended" regular expression.


2 Answers

The ?regex page cites the TRE documentation. Near the top of the grep.c source we see:

/* As from TRE 0.8.0, tre.h replaces regex.h */
#include <tre/tre.h>

And copying my earlier comment: http://swtch.com/~rsc/regexp says TRE uses NFA. Then PCRE is used for perl=TRUE.

like image 137
IRTFM Avatar answered Oct 04 '22 03:10

IRTFM


My understanding (But I have not found this in official documents) is that the R regex functions by default use the tcl regex library which is a hybrid of DFA and NFA.

The engine will first scan the regexp for any non-DFA compatible pieces and extract parts that are DFA (so strips out back references and other things that are only available in NFA). It then tries to find a match to this (possibly) simplified pattern using a DFA engine. If it cannot find a match then the full regex will not match and it returns with a failure. If it finds a match then it goes back and matches the full regex using an NFA engine (I think traditional/non-posix), but starting at the location where the simplified match occurred. This is much faster (for both non-matches and matches) than a straight NFA engine, but still lets you use all the things in an NFA that a DFA does not support.

If you specify perl=TRUE in any function then it switches to the pcre library which is most like a traditional NFA (though I understand that it is not F, A, or traditional).

like image 44
Greg Snow Avatar answered Oct 04 '22 02:10

Greg Snow