Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does the order of alternatives matter in regex?

Tags:

c#

.net

regex

Code

using System;
using System.Text.RegularExpressions;

namespace RegexNoMatch {
    class Program {
        static void Main () {
            string input = "a foobar& b";
            string regex1 = "(foobar|foo)&?";
            string regex2 = "(foo|foobar)&?";
            string replace = "$1";
            Console.WriteLine(Regex.Replace(input, regex1, replace));
            Console.WriteLine(Regex.Replace(input, regex2, replace));
            Console.ReadKey();
        }
    }
}

Expected output

a foobar b
a foobar b

Actual output

a foobar b
a foobar& b

Question

Why does replacing not work when the order of "foo" and "foobar" in regex pattern is changed? How to fix this?

like image 735
Athari Avatar asked Aug 02 '13 13:08

Athari


People also ask

Does order matter in regex?

The order of the characters inside a character class does not matter. The results are identical. You can use a hyphen inside a character class to specify a range of characters. [0-9] matches a single digit between 0 and 9.

What is alternation in regex?

Alternation is the term in regular expression that is actually a simple “OR”. In a regular expression it is denoted with a vertical line character | . For instance, we need to find programming languages: HTML, PHP, Java or JavaScript. The corresponding regexp: html|php|java(script)? .

Does * match everything in regex?

Throw in an * (asterisk), and it will match everything. Read more. \s (whitespace metacharacter) will match any whitespace character (space; tab; line break; ...), and \S (opposite of \s ) will match anything that is not a whitespace character.

What does \/ mean in regex?

\/ ------- will match for a "/" ^\/ ----- will match for a "/" at the beginning of the line. [^\/] ------ ^ inside a [ ] will be for negation(opposite of). it will match for anything except a "/"


1 Answers

The regular expression engine tries to match the alternatives in the order in which they are specified. So when the pattern is (foo|foobar)&? it matches foo immediately and continues trying to find matches. The next bit of the input string is bar& b which cannot be matched.

In other words, because foo is part of foobar, there is no way (foo|foobar) will ever match foobar, since it will always match foo first.

Occasionally, this can be a very useful trick, actually. The pattern (o|a|(\w)) will allow you to capture \w and a or o differently:

Regex.Replace("a foobar& b", "(o|a|(\\w))", "$2") // fbr& b
like image 141
p.s.w.g Avatar answered Sep 29 '22 17:09

p.s.w.g