Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java regex alternation operator "|" behavior seems broken

Trying to write a regex matcher for roman numerals. In sed (which I think is considered 'standard' for regex?), if you have multiple options delimited by the alternation operator, it will match the longest. Namely, "I|II|III|IV" will match "IV" for "IV" and "III" for "III"

In Java, the same pattern matches "I" for "IV" and "I" for "III". Turns out Java chooses between alternation matches left-to-right; that is, because "I" appears before "III" in the regex, it matches. If I change the regex to "IV|III|II|I", the behavior is corrected, but this obviously isn't a solution in general.

Is there a way to make Java choose the longest match out of an alternation group, instead of choosing the 'first'?

A code sample for clarity:

public static void main(String[] args)
{
    Pattern p = Pattern.compile("six|sixty");
    Matcher m = p.matcher("The year was nineteen sixty five.");
    if (m.find())
    {
        System.out.println(m.group());
    }
    else
    {
        System.out.println("wtf?");
    }
}

This outputs "six"

like image 991
Craig Kovatch Avatar asked Dec 23 '10 02:12

Craig Kovatch


People also ask

Which regex is used in alternation?

POSIX Alternation Gives Different Results This is because POSIX based regular expression engines are required to always try all possible options in an alternation and pick the longest match when searching from the leftmost position in the search text.

What does '$' mean in regex?

$ means "Match the end of the string" (the position after the last character in the string). Both are called anchors and ensure that the entire string is matched instead of just a substring.

What is \r in Java regex?

The \r metacharacter matches carriage return characters.

Is Java regex pattern thread safe?

The Regex class itself is thread safe and immutable (read-only). That is, Regex objects can be created on any thread and shared between threads; matching methods can be called from any thread and never alter any global state.


1 Answers

No, it's behaving correctly. Java uses an NFA, or regex-directed flavor, like Perl, .NET, JavaScript, etc., and unlike sed, grep, or awk. An alternation is expected to quit as soon as one of the alternatives matches, not hold out for the longest match.

You can force it to continue by adding a condition after the alternation that can't be met until the whole token has been consumed. What that condition might be depends on the context; the simplest option would be an anchor ($) or a word boundary (\b).

"\\b(I|II|III|IV)\\b"

EDIT: I should mention that, while grep, sed, awk and others traditionally use text-directed (or DFA) engines, you can also find versions of some of them that use NFA engines, or even hybrids of the two.

like image 199
Alan Moore Avatar answered Sep 29 '22 14:09

Alan Moore