Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How the Perl regular expressions dialect/implementation is called?

The engine for parsing strings which is called "regular expressions" in Perl is very different from what is known by the term "regular expressions" in books.

So, my question is: is there some document describing the Perl's regexp implementation and how and in what ways does it really differ from the classic one (by classic I mean a regular expressions that can really be transformed to ordinary DFA/NFA) and how it works?

Thank you.

like image 868
lithuak Avatar asked Dec 12 '22 12:12

lithuak


1 Answers

Perl regular expressions are of course called Perl regular expressions, or regexes for short. They may also be called patterns or rules. But what they are, or at least can be, is recursive descent parsers. They’re implemented using a recursive backtracker, although you can swap in a DFA engine if you prefer to offload DFA‐solvable tasks to it.

Here are some relevant citations about these matters, with all emboldening — and some of the text :) — mine:

You specify a pattern by creating a regular expression (or regex), and Perl’s regular expression engine (the “Engine”, for the rest of this chapter) then takes that expression and determines whether (and how) the pattern matches your data. While most of your data will probably be text strings, there’s nothing stopping you from using regexes to search and replace any byte sequence, even what you’d normally think of as “binary” data. To Perl, bytes are just characters that happen to have an ordinal value less than 256.

If you’re acquainted with regular expressions from some other venue, we should warn you that regular expressions are a bit different in Perl. First, they aren’t entirely “regular” in the theoretical sense of the word, which means they can do much more than the traditional regular expressions taught in computer science classes. Second, they are used so often in Perl that they have their own special variables, operators, and quoting conventions which are tightly integrated into the language, not just loosely bolted on like any other library.

      — Programming Perl, by Larry Wall, Tom Christiansen, and Jon Orwant

This is the Apocalypse on Pattern Matching, generally having to do with what we call “regular expressions”, which are only marginally related to real regular expressions. Nevertheless, the term has grown with the capabilities of our pattern matching engines, so I’m not going to try to fight linguistic necessity here. I will, however, generally call them “regexes” (or “regexen”, when I’m in an Anglo‐Saxon mood).

      — Perl6 Apocalypse 5: Pattern Matching, by Larry Wall

There’s a lot of new syntax there, so let’s step through it slowly, starting with:

    $file = rx/ ^  <$hunk>*  $ /;

This statement creates a pattern object. Or, as it’s known in Perl 6, a “rule”. People will probably still call them “regular expressions” or “regexes” too (and the keyword rx reflects that), but Perl patterns long ago ceased being anything like “regular”, so we’ll try and avoid those terms.

[Update: We’ve resurrected the term “regex” to refer to these patterns in general. When we say “rule” now, we’re specifically referring to the kind of regex that you would use in a grammar. See S05.]

      — Perl6 Exegesis 5: Pattern Matching, by Damian Conway

This document summarizes Apocalypse 5, which is about the new regex syntax. We now try to call them regex rather than “regular expressions” because they haven’t been regular expressions for a long time, and we think the popular term “regex” is in the process of becoming a technical term with a precise meaning of: “something you do pattern matching with, kinda like a regular expression”. On the other hand, one of the purposes of the redesign is to make portions of our patterns more amenable to analysis under traditional regular expression and parser semantics, and that involves making careful distinctions between which parts of our patterns and grammars are to be treated as declarative, and which parts as procedural.

In any case, when referring to recursive patterns within a grammar, the terms rule and token are generally preferred over regex.

      — Perl6 Synopsis 5: Regexes and Rules, by Damian Conway, Allison Randal, Patrick Michaud, Larry Wall, and Moritz Lenz

like image 126
tchrist Avatar answered Jan 04 '23 23:01

tchrist