Looking at some of the regex questions commonly asked on SO, it seems to me there's a number of areas where the traditional regex syntax is falling short of the kind of tasks people are looking for it to do nowadays. For instance:
The usual answer is don't use regex for this, use normal conditional comparisons. That's fine if you've got just the number by itself, but not so great when you want to match the number as part of a longer string. Why can't we write something like \d{1~31}
, and either modify the regex to do some form of counting or have the regex engine internally translate it into [1-9]|[12]\d|3[01]
?
This results in a very messy regex, it would be great to be able to just do (mytext){Odd}
.
We all know that's a bad idea, but this and similar tasks would be easier if the [^ ]
operator wasn't limited to just a single character. It'd be nice to be able to do <name>(.*)[^(</name>)]
Very commonly done and yet very complex to do correctly with regex. It'd save everyone having to re-invent the wheel if a syntax like {IsEmail}
could be used instead.
I'm sure there are others that would be useful too. I don't know too much about regex internals to know how easy these would be too implement, or if it would even be possible. Implementing some form of counting (to solve the first two problems) may mean it's not technically a 'regular expression' anymore, but it sure would be useful.
Is a 'regex 2.0' syntax desirable, technically possible, and is there anyone working on anything like this ?
I believe Larry Wall covered this with Perl 6 regexes. The basic idea is to replace simple regular expressions with more-useful grammar rules. They're easier to read and it's easier to put code in for things like making sure that you have an number of matches. Plus, you can name rules like IsEmail
. I can't possibly list all the details here, but suffice it to say, it sounds like what you're suggesting.
Here are some examples from http://dev.perl.org/perl6/doc/design/exe/E05.html:
Matching IP address:
token quad { (\d**1..3) <?{ $1 < 256 }> }
$str ~~ m/ <quad> <dot> <quad> <dot> <quad> <dot> <quad> /;
Matching nested parentheses:
$str =~ m/ \( [ <-[()]> + : | <self> ]* \) /;
Annotated:
$str =~ m/ <'('> # Match a literal '('
[ # Start a non-capturing group
<-[()]> + # Match a non-paren (repeatedly)
: # ...and never backtrack that match
| # Or
<self> # Recursively match entire pattern
]* # Close group and match repeatedly
<')'> # Match a literal ')'
/;
Don't blame the tool, blame the user.
Regular Expressions were built for matching patterns in strings. That's it.
It was not made for:
Should we redesign the screwdriver because it can't nail? NO, use a hammer.
Simply use the proper tool for the task. Stop using regular expressions for tasks which they don't qualify for.
I want to match a number between 1 and 31, how do I do that?
Use your language constructs to try to convert the string to an integer and do the appropriate comparisons.
How do I match an even/odd number of occurrences of a specific string?
Regular expressions are not a string parser. You can however extract the relevant part with a regular expression if you only need to parse a sub-section of the original string.
How do I parse XML with regex?
You don't. Use a XML or a HTML parser depending on your need. Also, an XML parser can't do the job of an HTML parser (unless you have a perfectly formed XHTML document) and the reverse is also true.
How do I validate an email with regex?
You either use this large abomination or you do it properly with a parser.
All of those are reasonably possible in Perl.
To match a 1..31 with a regex pattern:
/( [0-9]+ ) (?(?{ $^N < 1 && $^N > 31 })(*FAIL)) /x
To generate something like [1-9]|[12]\d|3[01]
:
use Regexp::Assemble qw( );
my $ra = Regexp::Assemble->new();
$ra->add($_) for (1..31);
my $re = $ra->re; # qr/(?:[456789]|3[01]?|1\d?|2\d?)/
Perl 5.10+ uses tries to optimise alternations, so the following should be sufficient:
my $re = join '|', 1..31;
$re = qr/$re/;
To match an even number of occurrences:
/ (?: pat{2} )* /x
To match an odd number of occurrences:
/ pat (?: pat{2} )* /x
Pattern negative match:
/<name> (.*?) </name>/x # Non-greedy matching
/<name> ( (?: (?!</name>). )* ) </name>/x
To get a pattern matching email addresses:
use Regexp::Common qw( Email::Address );
/$RE{Email}{Address}/
Probably it is already there and from a long time ago. It's called "grammars". Ever heard of yacc and lex ? Now there is a need for something simple. As strange it may appear, the big strength of regex is that they are very simple to write on the spot.
I believe in some (but large) specialized areas there is already what is needed. I'm thinking of XPath syntax.
Is there a larger (not limited to XML but still simple) alternative around that could cover all cases ? Maybe you should take a look at perl 6 grammars.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With