Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are regular expressions so controversial? [closed]

Tags:

regex

I don't think people object to regular expressions because they're slow, but rather because they're hard to read and write, as well as tricky to get right. While there are some situations where regular expressions provide an effective, compact solution to the problem, they are sometimes shoehorned into situations where it's better to use an easy-to-read, maintainable section of code instead.


Making Regexes Maintainable

A major advance toward demystify the patterns previously referred to as “regular expressions” is Perl’s /x regex flag — sometimes written (?x) when embedded — that allows whitespace (line breaking, indenting) and comments. This seriously improves readability and therefore maintainability. The white space allow for cognitive chunking, so you can see what groups with what.

Modern patterns also now support both relatively numbered and named backreferences now. That means you no longer need to count capture groups to figure out that you need $4 or \7. This helps when creating patterns that can be included in further patterns.

Here is an example a relatively numbered capture group:

$dupword = qr{ \b (?: ( \w+ ) (?: \s+ \g{-1} )+ ) \b }xi;
$quoted  = qr{ ( ["'] ) $dupword  \1 }x;

And here is an example of the superior approach of named captures:

$dupword = qr{ \b (?: (?<word> \w+ ) (?: \s+ \k<word> )+ ) \b }xi;
$quoted  = qr{ (?<quote> ["'] ) $dupword  \g{quote} }x;

Grammatical Regexes

Best of all, these named captures can be placed within a (?(DEFINE)...) block, so that you can separate out the declaration from the execution of individual named elements of your patterns. This makes them act rather like subroutines within the pattern.
A good example of this sort of “grammatical regex” can be found in this answer and this one. These look much more like a grammatical declaration.

As the latter reminds you:

… make sure never to write line‐noise patterns. You don’t have to, and you shouldn’t. No programming language can be maintainable that forbids white space, comments, subroutines, or alphanumeric identifiers. So use all those things in your patterns.

This cannot be over-emphasized. Of course if you don’t use those things in your patterns, you will often create a nightmare. But if you do use them, though, you need not.

Here’s another example of a modern grammatical pattern, this one for parsing RFC 5322: use 5.10.0;

$rfc5322 = qr{

   (?(DEFINE)

     (?<address>         (?&mailbox) | (?&group))
     (?<mailbox>         (?&name_addr) | (?&addr_spec))
     (?<name_addr>       (?&display_name)? (?&angle_addr))
     (?<angle_addr>      (?&CFWS)? < (?&addr_spec) > (?&CFWS)?)
     (?<group>           (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ; (?&CFWS)?)
     (?<display_name>    (?&phrase))
     (?<mailbox_list>    (?&mailbox) (?: , (?&mailbox))*)

     (?<addr_spec>       (?&local_part) \@ (?&domain))
     (?<local_part>      (?&dot_atom) | (?&quoted_string))
     (?<domain>          (?&dot_atom) | (?&domain_literal))
     (?<domain_literal>  (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
                                   \] (?&CFWS)?)
     (?<dcontent>        (?&dtext) | (?&quoted_pair))
     (?<dtext>           (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])

     (?<atext>           (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|}~])
     (?<atom>            (?&CFWS)? (?&atext)+ (?&CFWS)?)
     (?<dot_atom>        (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
     (?<dot_atom_text>   (?&atext)+ (?: \. (?&atext)+)*)

     (?<text>            [\x01-\x09\x0b\x0c\x0e-\x7f])
     (?<quoted_pair>     \\ (?&text))

     (?<qtext>           (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
     (?<qcontent>        (?&qtext) | (?&quoted_pair))
     (?<quoted_string>   (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
                          (?&FWS)? (?&DQUOTE) (?&CFWS)?)

     (?<word>            (?&atom) | (?&quoted_string))
     (?<phrase>          (?&word)+)

     # Folding white space
     (?<FWS>             (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
     (?<ctext>           (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
     (?<ccontent>        (?&ctext) | (?&quoted_pair) | (?&comment))
     (?<comment>         \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
     (?<CFWS>            (?: (?&FWS)? (?&comment))*
                         (?: (?:(?&FWS)? (?&comment)) | (?&FWS)))

     # No whitespace control
     (?<NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])

     (?<ALPHA>           [A-Za-z])
     (?<DIGIT>           [0-9])
     (?<CRLF>            \x0d \x0a)
     (?<DQUOTE>          ")
     (?<WSP>             [\x20\x09])
   )

   (?&address)

}x;

Isn't that remarkable — and splendid? You can take a BNF-style grammar and translate it directly into code without losing its fundamental structure!

If modern grammatical patterns still aren’t enough for you, then Damian Conway’s brilliant Regexp::Grammars module offers an even cleaner syntax, with superior debugging, too. Here’s the same code for parsing RFC 5322 recast into a pattern from that module:

#!/usr/bin/perl

use strict;
use warnings;
use 5.010;
use Data::Dumper "Dumper";

my $rfc5322 = do {
    use Regexp::Grammars;    # ...the magic is lexically scoped
    qr{

    # Keep the big stick handy, just in case...
    # <debug:on>

    # Match this...
    <address>

    # As defined by these...
    <token: address>         <mailbox> | <group>
    <token: mailbox>         <name_addr> | <addr_spec>
    <token: name_addr>       <display_name>? <angle_addr>
    <token: angle_addr>      <CFWS>? \< <addr_spec> \> <CFWS>?
    <token: group>           <display_name> : (?:<mailbox_list> | <CFWS>)? ; <CFWS>?
    <token: display_name>    <phrase>
    <token: mailbox_list>    <[mailbox]> ** (,)

    <token: addr_spec>       <local_part> \@ <domain>
    <token: local_part>      <dot_atom> | <quoted_string>
    <token: domain>          <dot_atom> | <domain_literal>
    <token: domain_literal>  <CFWS>? \[ (?: <FWS>? <[dcontent]>)* <FWS>?

    <token: dcontent>        <dtext> | <quoted_pair>
    <token: dtext>           <.NO_WS_CTL> | [\x21-\x5a\x5e-\x7e]

    <token: atext>           <.ALPHA> | <.DIGIT> | [!#\$%&'*+-/=?^_`{|}~]
    <token: atom>            <.CFWS>? <.atext>+ <.CFWS>?
    <token: dot_atom>        <.CFWS>? <.dot_atom_text> <.CFWS>?
    <token: dot_atom>        <.CFWS>? <.dot_atom_text> <.CFWS>?
    <token: dot_atom_text>   <.atext>+ (?: \. <.atext>+)*

    <token: text>            [\x01-\x09\x0b\x0c\x0e-\x7f]
    <token: quoted_pair>     \\ <.text>

    <token: qtext>           <.NO_WS_CTL> | [\x21\x23-\x5b\x5d-\x7e]
    <token: qcontent>        <.qtext> | <.quoted_pair>
    <token: quoted_string>   <.CFWS>? <.DQUOTE> (?:<.FWS>? <.qcontent>)*
                             <.FWS>? <.DQUOTE> <.CFWS>?

    <token: word>            <.atom> | <.quoted_string>
    <token: phrase>          <.word>+

    # Folding white space
    <token: FWS>             (?: <.WSP>* <.CRLF>)? <.WSP>+
    <token: ctext>           <.NO_WS_CTL> | [\x21-\x27\x2a-\x5b\x5d-\x7e]
    <token: ccontent>        <.ctext> | <.quoted_pair> | <.comment>
    <token: comment>         \( (?: <.FWS>? <.ccontent>)* <.FWS>? \)
    <token: CFWS>            (?: <.FWS>? <.comment>)*
                             (?: (?:<.FWS>? <.comment>) | <.FWS>)

    # No whitespace control
    <token: NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f]

    <token: ALPHA>           [A-Za-z]
    <token: DIGIT>           [0-9]
    <token: CRLF>            \x0d \x0a
    <token: DQUOTE>          "
    <token: WSP>             [\x20\x09]

    }x;

};


while (my $input = <>) {
    if ($input =~ $rfc5322) {
        say Dumper \%/;       # ...the parse tree of any successful match
                              # appears in this punctuation variable
    }
}

There’s a lot of good stuff in the perlre manpage, but these dramatic improvements in fundamental regex design features are by no means limited to Perl alone. Indeed the pcrepattern manpage may be an easier read, and covers the same territory.

Modern patterns have almost nothing in common with the primitive things you were taught in your finite automata class.


Regexes are a great tool, but people think "Hey, what a great tool, I will use it to do X!" where X is something that a different tool is better for (usually a parser). It is the standard using a hammer where you need a screwdriver problem.


Almost everyone I know who uses regular expressions regularly (pun intended) comes from a Unix-ish background where they use tools that treat REs as first-class programming constructs, such as grep, sed, awk, and Perl. Since there's almost no syntactic overhead to use a regular expression, their productivity goes way up when they do.

In contrast, programmers who use languages in which REs are an external library tend not to consider what regular expressions can bring to the table. The programmer "time-cost" is so high that either a) REs never appeared as part of their training, or b) they don't "think" in terms of REs and prefer to fall back on more familiar patterns.


Regular expressions allow you to write a custom finite-state machine (FSM) in a compact way, to process a string of input. There are at least two reasons why using regular expressions is hard:

  • Old-school software development involves a lot of planning, paper models, and careful thought. Regular expressions fit into this model very well, because to write an effective expression properly involves a lot of staring at it, visualizing the paths of the FSM.

    Modern software developers would much rather hammer out code, and use a debugger to step through execution, to see if the code is correct. Regular expressions do not support this working style very well. One "run" of a regular expression is effectively an atomic operation. It's hard to observe stepwise execution in a debugger.

  • It's too easy to write a regular expression that accidentally accepts more input than you intend. The value of a regular expression isn't really to match valid input, it's to fail to match invalid input. Techniques to do "negative tests" for regular expressions are not very advanced, or at least not widely used.

    This goes to the point of regular expressions being hard to read. Just by looking at a regular expression, it takes a lot of concentration to visualize all possible inputs that should be rejected, but are mistakenly accepted. Ever try to debug someone else's regular expression code?

If there's a resistance to using regular expressions among software developers today, I think it's chiefly due to these two factors.


People tend to think regular expressions are hard; but that's because they're using them wrong. Writing complex one-liners without any comments, indenting or named captures. (You don't cram your complex SQL expression in one line, without comments, indenting or aliases, do you?). So yes, for a lot of people, they don't make sense.

However, if your job has anything to do with parsing text (roughly any web-application out there...) and you don't know regular expression, you suck at your job and you're wasting your own time and that of your employer. There are excellent resources out there to teach you everything about them that you'll ever need to know, and more.


Because they lack the most popular learning tool in the commonly accepted IDEs: There's no Regex Wizard. Not even Autocompletion. You have to code the whole thing all by yourself.