Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I write more maintainable regular expressions?

I have started to feel that using regular expressions decreases code maintainability. There is something evil about the terseness and power of regular expressions. Perl compounds this with side effects like default operators.

I DO have a habit of documenting regular expressions with at least one sentence giving the basic intent and at least one example of what would match.

Because regular expressions are built up I feel it is an absolute necessity to comment on the largest components of each element in the expression. Despite this even my own regular expressions have me scratching my head as though I am reading Klingon.

Do you intentionally dumb down your regular expressions? Do you decompose possibly shorter and more powerful ones into simpler steps? I have given up on nesting regular expressions. Are there regular expression constructs that you avoid due to mainainability issues?

Do not let this example cloud the question.

If the following by Michael Ash had some sort of bug in it would you have any prospects of doing anything but throwing it away entirely?

^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$ 

Per request the exact purpose can be found using Mr. Ash's link above.

Matches 01.1.02 | 11-30-2001 | 2/29/2000

Non-Matches 02/29/01 | 13/01/2002 | 11/00/02

like image 711
ojblass Avatar asked Apr 02 '09 04:04

ojblass


People also ask

What does $1 do in regex?

For example, the replacement pattern $1 indicates that the matched substring is to be replaced by the first captured group.

How do I write regular expressions?

Writing a regular expression pattern. A regular expression pattern is composed of simple characters, such as /abc/ , or a combination of simple and special characters, such as /ab*c/ or /Chapter (\d+)\. \d*/ . The last example includes parentheses, which are used as a memory device.

How do you write special characters in regex?

Special Regex Characters: These characters have special meaning in regex (to be discussed below): . , + , * , ? , ^ , $ , ( , ) , [ , ] , { , } , | , \ . Escape Sequences (\char): To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ).


1 Answers

Use Expresso which gives a hierarchical, english breakdown of a regex.

Or

This tip from Darren Neimke:

.NET allows regular expression patterns to be authored with embedded comments via the RegExOptions.IgnorePatternWhitespace compiler option and the (?#...) syntax embedded within each line of the pattern string.

This allows for psuedo-code-like comments to be embedded in each line and has the following affect on readability:

Dim re As New Regex ( _     "(?<=       (?# Start a positive lookBEHIND assertion ) " & _     "(#|@)      (?# Find a # or a @ symbol ) " & _     ")          (?# End the lookBEHIND assertion ) " & _     "(?=        (?# Start a positive lookAHEAD assertion ) " & _     "   \w+     (?# Find at least one word character ) " & _     ")          (?# End the lookAHEAD assertion ) " & _     "\w+\b      (?# Match multiple word characters leading up to a word boundary)", _     RegexOptions.Multiline Or RegexOptions.IgnoreCase Or RegexOptions.IgnoreWhitespace _ ) 

Here's another .NET example (requires the RegexOptions.Multiline and RegexOptions.IgnorePatternWhitespace options):

static string validEmail = @"\b    # Find a word boundary                 (?<Username>       # Begin group: Username                 [a-zA-Z0-9._%+-]+  #   Characters allowed in username, 1 or more                 )                  # End group: Username                 @                  # The e-mail '@' character                 (?<Domainname>     # Begin group: Domain name                 [a-zA-Z0-9.-]+     #   Domain name(s), we include a dot so that                                    #   mail.somewhere is also possible                 .[a-zA-Z]{2,4}     #   The top level domain can only be 4 characters                                    #   So .info works, .telephone doesn't.                 )                  # End group: Domain name                 \b                 # Ending on a word boundary                 "; 

If your RegEx is applicable to a common problem, another option is to document it and submit to RegExLib, where it will be rated and commented upon. Nothing beats many pairs of eyes...

Another RegEx tool is The Regulator

like image 176
Mitch Wheat Avatar answered Sep 19 '22 12:09

Mitch Wheat