Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A successor to regex? [closed]

Tags:

regex

Looking at some of the regex questions commonly asked on SO, it seems to me there's a number of areas where the traditional regex syntax is falling short of the kind of tasks people are looking for it to do nowadays. For instance:

  • I want to match a number between 1 and 31, how do I do that ?

The usual answer is don't use regex for this, use normal conditional comparisons. That's fine if you've got just the number by itself, but not so great when you want to match the number as part of a longer string. Why can't we write something like \d{1~31}, and either modify the regex to do some form of counting or have the regex engine internally translate it into [1-9]|[12]\d|3[01] ?

  • How do I match an even/odd number of occurrences of a specific string ?

This results in a very messy regex, it would be great to be able to just do (mytext){Odd}.

  • How do I parse XML with regex ?

We all know that's a bad idea, but this and similar tasks would be easier if the [^ ] operator wasn't limited to just a single character. It'd be nice to be able to do <name>(.*)[^(</name>)]

  • How do I validate an email with regex ?

Very commonly done and yet very complex to do correctly with regex. It'd save everyone having to re-invent the wheel if a syntax like {IsEmail} could be used instead.


I'm sure there are others that would be useful too. I don't know too much about regex internals to know how easy these would be too implement, or if it would even be possible. Implementing some form of counting (to solve the first two problems) may mean it's not technically a 'regular expression' anymore, but it sure would be useful.

Is a 'regex 2.0' syntax desirable, technically possible, and is there anyone working on anything like this ?

like image 551
Michael Low Avatar asked Jan 27 '11 07:01

Michael Low


4 Answers

I believe Larry Wall covered this with Perl 6 regexes. The basic idea is to replace simple regular expressions with more-useful grammar rules. They're easier to read and it's easier to put code in for things like making sure that you have an number of matches. Plus, you can name rules like IsEmail. I can't possibly list all the details here, but suffice it to say, it sounds like what you're suggesting.

Here are some examples from http://dev.perl.org/perl6/doc/design/exe/E05.html:

Matching IP address:

token quad {  (\d**1..3) <?{ $1 < 256 }>  }
$str ~~ m/ <quad> <dot> <quad> <dot> <quad> <dot> <quad> /;

Matching nested parentheses:

$str =~ m/ \(  [ <-[()]> + : | <self> ]*  \) /;

Annotated:

    $str =~ m/ <'('>                # Match a literal '('
               [                    # Start a non-capturing group
                    <-[()]> +       #    Match a non-paren (repeatedly)
                    :               #    ...and never backtrack that match
               |                    # Or
                    <self>          #    Recursively match entire pattern
               ]*                   # Close group and match repeatedly
               <')'>                # Match a literal ')'
             /;
like image 52
Gabe Avatar answered Oct 01 '22 11:10

Gabe


Don't blame the tool, blame the user.

Regular Expressions were built for matching patterns in strings. That's it.

It was not made for:

  • Integer validation
  • Markup language parsing
  • Very complex validation (ie.: RFC 2822)
  • Exact string comparison
  • Spelling correction
  • Vector computation
  • Genetic decoding
  • Miracle making
  • Baby saving
  • Finance administering
  • Sub-atomic partitioning
  • Flux capacitor activating
  • Warp core engaging
  • Time traveling
  • Headache inducing
    Never-mind that last one. It seems that regular expressions are very well adapted to doing that last task when they are being used where they shouldn't.

Should we redesign the screwdriver because it can't nail? NO, use a hammer.

Simply use the proper tool for the task. Stop using regular expressions for tasks which they don't qualify for.

  • I want to match a number between 1 and 31, how do I do that?
    Use your language constructs to try to convert the string to an integer and do the appropriate comparisons.

  • How do I match an even/odd number of occurrences of a specific string?
    Regular expressions are not a string parser. You can however extract the relevant part with a regular expression if you only need to parse a sub-section of the original string.

  • How do I parse XML with regex?
    You don't. Use a XML or a HTML parser depending on your need. Also, an XML parser can't do the job of an HTML parser (unless you have a perfectly formed XHTML document) and the reverse is also true.

  • How do I validate an email with regex?
    You either use this large abomination or you do it properly with a parser.

like image 39
Andrew Moore Avatar answered Oct 01 '22 12:10

Andrew Moore


All of those are reasonably possible in Perl.

To match a 1..31 with a regex pattern:

/( [0-9]+ ) (?(?{ $^N < 1 && $^N > 31 })(*FAIL)) /x

To generate something like [1-9]|[12]\d|3[01]:

use Regexp::Assemble qw( );
my $ra = Regexp::Assemble->new();
$ra->add($_) for (1..31);
my $re = $ra->re;                 # qr/(?:[456789]|3[01]?|1\d?|2\d?)/

Perl 5.10+ uses tries to optimise alternations, so the following should be sufficient:

my $re = join '|', 1..31;
$re = qr/$re/;

To match an even number of occurrences:

/ (?: pat{2} )* /x

To match an odd number of occurrences:

/ pat (?: pat{2} )* /x

Pattern negative match:

/<name> (.*?) </name>/x  # Non-greedy matching

/<name> ( (?: (?!</name>). )* ) </name>/x

To get a pattern matching email addresses:

use Regexp::Common qw( Email::Address );
/$RE{Email}{Address}/
like image 23
ikegami Avatar answered Oct 01 '22 10:10

ikegami


Probably it is already there and from a long time ago. It's called "grammars". Ever heard of yacc and lex ? Now there is a need for something simple. As strange it may appear, the big strength of regex is that they are very simple to write on the spot.

I believe in some (but large) specialized areas there is already what is needed. I'm thinking of XPath syntax.

Is there a larger (not limited to XML but still simple) alternative around that could cover all cases ? Maybe you should take a look at perl 6 grammars.

like image 22
kriss Avatar answered Oct 01 '22 12:10

kriss