Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to make portable regex?

Which features of regular expressions are standard, and which are idiosyncratic ?
What should I do, and not do, if I want to use the same regex in different context, languages, platforms ?

like image 525
dugres Avatar asked May 16 '10 09:05

dugres


People also ask

How do you create a string in regEx?

Example : ^\d{3} will match with patterns like "901" in "901-333-". It tells the computer that the match must occur at the end of the string or before \n at the end of the line or string. Example : -\d{3}$ will match with patterns like "-333" in "-901-333". A character class matches any one of a set of characters.

What is Posix regEx?

POSIX bracket expressions are a special kind of character classes. POSIX bracket expressions match one character out of a set of characters, just like regular character classes. They use the same syntax with square brackets. A hyphen creates a range, and a caret at the start negates the bracket expression.

What is regEx validation?

RegEx validation is essentially a syntax check which makes it possible to see whether an email address is spelled correctly, has no spaces, commas, and all the @s, dots and domain extensions are in the right place.

How do you test a regular expression?

To test a regular expression, first search for errors such as non-escaped characters or unbalanced parentheses. Then test it against various input strings to ensure it accepts correct strings and regex wrong ones. A regex tester tool is a great tool that does all of this.


1 Answers

There is no standard, but if maximum portability is your goal you should stick to the features supported by JavaScript regexes. All of the other major flavors support everything JS does, with only minor variations here and there. For example, some only support the POSIX character-class notation ([:alpha:]), while others use the Unicode syntax (\p{Alpha}).

Probably the most troublesome variations are those that affect the dot (.) and the anchors (^ and $). For example, JavaScript has no DOTALL (or "single-line") mode, so to match anything including a newline you have to use a hack like [\s\S]. Meanwhile, Ruby has a DOTALL mode but calls it multiline mode--what everyone else calls "multiline" (^ and $ as line anchors) is how Ruby always works.

Be aware, too, of exactly what the dot doesn't match (in the default mode). Traditionally that was just the linefeed (\n), but more and more flavors are adopting (or at least approximating) the Unicode guidelines concerning line separators. For example, in Java the dot doesn't match any of [\r\n\u0085\u2028\u2029], while ^ and $ treat \r\n as a single separator and won't match between the two characters.

Note that I'm only talking about Perl-derived flavors, like Python, Ruby, PHP, JavaScript, etc.. It wouldn't make sense to inlcude GNU or POSIX based flavors like grep, awk, and MySQL; they tend to have fewer features, but that's not what you would choose them for anyway.

I'm also not including the XML Schema flavor; it's much more limited than JavaScript, but it's a specialized application. For example, it doesn't support the anchors (^, $, \A, \Z, etc.) because matches are always anchored at both ends.

like image 84
Alan Moore Avatar answered Oct 10 '22 00:10

Alan Moore