Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why implement a different regex engine (e.g. PCRE) as a pragma?

Tags:

regex

perl

pcre

re2

I'm curious about the best practices for using a different regex engine in place of the default Perl one and why the modules I've seen are pragmas and not a more traditional OO/procedural interface. I was wondering why that is.

I've seen a handful modules for replacing the Perl regex engine with PCRE (re::engine::PCRE), TRE (re::engine::TRE), or RE2 (re::engine::RE2) in a given lexical context. I can't find any object oriented modules for creating/compiling regular expressions that use a different back end. I'm curious why someone would choose to implement this functionality as a pragma rather than as a more typical module. It seems like replacing the perl regex engine would be a lot harder (depending on the complexity of the API it exposes) than making an XS script that exposes the API that PCRE, TRE, and RE2 already provide.

like image 631
Gregory Nisbet Avatar asked Jul 26 '15 01:07

Gregory Nisbet


People also ask

What is regex engine?

A regex engine executes the regex one character at a time in left-to-right order. This input string itself is parsed one character at a time, in left-to-right order. Once a character is matched, it's said to be consumed from the input, and the engine moves to the next input character. The engine is by default greedy.

What regex engine does r use?

By default R uses POSIX extended regular expressions, though if extended is set to FALSE , it will use basic POSIX regular expressions. If perl is set to TRUE , R will use the Perl 5 flavor of regular expressions as implemented in the PCRE library.

What is the meaning of $1 in Perl regex?

$1 equals the text " brown ".

What is S in Perl regex?

The Substitution Operator The substitution operator, s///, is really just an extension of the match operator that allows you to replace the text matched with some new text. The basic form of the operator is − s/PATTERN/REPLACEMENT/;

What is a regex-directed engine?

A regex-directed engine walks through the regex, attempting to match the next token in the regex to the next character. If a match is found, the engine advances through the regex and the subject string.

What is regex++?

^ Formerly called Regex++. ^ a b One of fuzzy regular expression engines. ^ Included since version 2.13.0. ^ ICU4J, the Java version, does not support regular expressions.

What are regular expressions used for in programming?

Regular expressions are used in search engines, search and replace dialogs of word processors and text editors, in text processing utilities such as sed and AWK and in lexical analysis. Many programming languages provide regex capabilities either built-in or via libraries, as it has uses in many situations.

When applying a regex to a string?

When applying a regex to a string, the engine starts at the first character of the string. It tries all possible permutations of the regular expression at the first character. Only if all possibilities have been tried and found to fail, does the engine continue with the second character in the text.


1 Answers

I'm curious about...why the modules I've seen are pragmas and not a more traditional OO/procedural interface.

Probably because the Perl regex API, documented in perldoc perlreapi and available since 5.9.5, lets you take advantage of Perl's parser, which gives you a lot of cool features with little code.

If you use the API, you:

  • don't have to implement your own version of split and the substitution operator s///
  • don't have to write your own code to parse regex modifiers (msixpn are passed as flags to your implementation's callback functions)
  • can take advantage of optimizations like constant regexes being compiled only once (at compile time) and regexes containing interpolated variables being compiled only when the variables change
  • can use qr in your programs to quote regular expressions and easily interpolate them into other regexes
  • can easily set numbered and named capture variables, e.g. $1, $+{foo}
  • don't force users of your engine to rewrite all of their code to use your API; they can simply add a pragma

There are probably more that I've missed. The point is, you get a lot of free code and free functionality with the API. If you look at the implementation of re::engine::PCRE, for example, it's actually fairly short (< 400 lines of XS code).

Alternatives

If you're just looking for an easier way to implement your own regex engine, check out re::engine::Plugin, which lets you write your implementation in Perl instead of C/XS. Do note that there is a long list of caveats, including no support for split and s///.

Alternatively, instead of implementing a completely custom engine, you can extend the built-in engine by using overloaded constants as described in perldoc perlre. This only works in constant regexes; you have to explicitly convert variables before interpolating them into a regex.

like image 105
ThisSuitIsBlackNot Avatar answered Nov 12 '22 21:11

ThisSuitIsBlackNot