Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a better way to write Perl regexes with /x so the code is still easy to read?

I ran Perl::Critic on one of my scripts, and got this message:

Regular expression without "/x" flag at line 21, column 26. See page 236 of PBP.

I looked up the policy information here, and I understand that writing regular expressions in extended mode will help anyone who is looking at the code.

However, I am stuck as how to convert my code to use the /x flag.

CPAN Example:

# Match a single-quoted string efficiently...

m{'[^\\']*(?:\\.[^\\']*)*'};  #Huh?

# Same thing with extended format...

m{
    '           # an opening single quote
    [^\\']      # any non-special chars (i.e. not backslash or single quote)
    (?:         # then all of...
        \\ .    #    any explicitly backslashed char
        [^\\']* #    followed by an non-special chars
    )*          # ...repeated zero or more times
    '           # a closing single quote
}x;

This makes sense if you only look at the regex.

My Code:

if ($line =~ /^\s*package\s+(\S+);/ ) {

I am not exactly sure how to use an extended regex inside of an if statement. I can write it like this:

    if (
        $line =~ /
        ^\s*    # starting with zero or more spaces
        package
        \s+     # at least one space
        (\S+)   # capture any non-space characters
        ;       # ending in a semi-colon
        /x
      )
    {

And this works, but I think this is almost harder to read than the original. Is there a better way (or a best practice way) to write this? I guess I could create a variable using qr//.

I'm not really looking for advice on re-writing this specific regex (although if I can improve it, I'll take advice) - I'm more looking for advice on how to expand a regex inside of an if statement.

I know Perl::Critic is just a guideline, but it would be nice to follow it.

Thanks in advance!

EDIT: So after receiving a few answers, it became clear to me that making a regex multi-line with comments is not always necessary. People who understand basic regex should be able to understand what my example was doing - the comments I added were maybe a little unnecessary and verbose. I like the idea of using the extended regex flag, but still embedding spaces in the regex to make each part of the regex a little more clear. Thanks for all the input!

like image 448
BrianH Avatar asked Jun 12 '09 15:06

BrianH


People also ask

Is Perl good for regex?

In general, Perl uses a backtrack regex engine. Such an engine is flexible, easy to implement and very fast on a subset of regex. However, for other types of regex, for example when there is the | operator, it may become very slow.

Why is regex so complicated?

Regular expressions are dense. This makes them hard to read, but not in proportion to the information they carry. Certainly 100 characters of regular expression syntax is harder to read than 100 consecutive characters of ordinary prose or 100 characters of C code.

What is \d in Perl regex?

The Special Character Classes in Perl are as follows: Digit \d[0-9]: The \d is used to match any digit character and its equivalent to [0-9]. In the regex /\d/ will match a single digit. The \d is standardized to “digit”.

What does =~ do in Perl?

The operator =~ associates the string with the regex match and produces a true value if the regex matched, or false if the regex did not match.


1 Answers

Never write a comment that says what the code says. Comments should tell you why the code says what it says. Take a look at this monstrosity, without the comments it is very difficult to see what is going on, but the comments make it clear what is trying to be matched:

require 5.010;
my $sep         = qr{ [/.-] }x;               #allowed separators    
my $any_century = qr/ 1[6-9] | [2-9][0-9] /x; #match the century 
my $any_decade  = qr/ [0-9]{2} /x;            #match any decade or 2 digit year
my $any_year    = qr/ $any_century? $any_decade /x; #match a 2 or 4 digit year

#match the 1st through 28th for any month of any year
my $start_of_month = qr/
    (?:                         #match
        0?[1-9] |               #Jan - Sep or
        1[0-2]                  #Oct - Dec
    )
    ($sep)                      #the separator
    (?: 
        0?[1-9] |               # 1st -  9th or
        1[0-9]  |               #10th - 19th or
        2[0-8]                  #20th - 28th
    )
    \g{-1}                      #and the separator again
/x;

#match 28th - 31st for any month but Feb for any year
my $end_of_month = qr/
    (?:
        (?: 0?[13578] | 1[02] ) #match Jan, Mar, May, Jul, Aug, Oct, Dec
        ($sep)                  #the separator
        31                      #the 31st
        \g{-1}                  #and the separator again
        |                       #or
        (?: 0?[13-9] | 1[0-2] ) #match all months but Feb
        ($sep)                  #the separator
        (?:29|30)               #the 29th or the 30th
        \g{-1}                  #and the separator again
    )
/x;

#match any non-leap year date and the first part of Feb in leap years
my $non_leap_year = qr/ (?: $start_of_month | $end_of_month ) $any_year/x;

#match 29th of Feb in leap years
#BUG: 00 is treated as a non leap year
#even though 2000, 2400, etc are leap years
my $feb_in_leap = qr/
    0?2                         #match Feb
    ($sep)                      #the separtor
    29                          #the 29th
    \g{-1}                      #the separator again
    (?:
        $any_century?           #any century
        (?:                     #and decades divisible by 4 but not 100
            0[48]       | 
            [2468][048] |
            [13579][26]
        )
        |
        (?:                     #or match centuries that are divisible by 4
            16          | 
            [2468][048] |
            [3579][26]
        )
        00                      
    )
/x;

my $any_date  = qr/$non_leap_year|$feb_in_leap/;
my $only_date = qr/^$any_date$/;
like image 197
Chas. Owens Avatar answered Sep 25 '22 23:09

Chas. Owens