Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Regular Expressions equivalent to PCRE/etc. shorthand `\K`?

Perl RegEx and PCRE (Perl-Compatible RegEx) amongst others have the shorthand \K to discard all matches to the left of it except for capturing groups, but Java doesn't support it, so what's Java's equivalent to it ?

like image 732
rautamiekka Avatar asked Jul 23 '17 13:07

rautamiekka


People also ask

Is Java a PCRE?

The first question to be answered is 'What flavour of regular expression is being used in Java? ' Well, Java uses PCRE (Perl Compatible Regular Expressions).

What does .+? Mean in regex?

This means it tries to match as few times as possible, instead of trying to match as many times as possible.

What type of regex does Java use?

The Java regex package implements a "Perl-like" regular expressions engine, but it has some extra features like possessive quantifiers ( . *+ ) and variable-length (but finite) lookbehind assertions).

What does \s+ mean in regex?

On the other hand, the \S+ (uppercase S ) matches anything that is NOT matched by \s , i.e., non-whitespace. In regex, the uppercase metacharacter denotes the inverse of the lowercase counterpart, for example, \w for word character and \W for non-word character; \d for digit and \D or non-digit.


1 Answers

There is no direct equivalent. However, you can always re-write such patterns using capturing groups.

If you have a closer look at \K operator and its limitations, you will see you can replace this pattern with capturing groups.

See rexegg.com \K reference:

In the middle of a pattern, \K says "reset the beginning of the reported match to this point". Anything that was matched before the \K goes unreported, a bit like in a lookbehind.

The key difference between \K and a lookbehind is that in PCRE, a lookbehind does not allow you to use quantifiers: the length of what you look for must be fixed. On the other hand, \K can be dropped anywhere in a pattern, so you are free to have any quantifiers you like before the \K.

However, all this means that the pattern before \K is still a consuming pattern, i.e. the regex engine adds up the matched text to the match value and advances its index while matching the pattern, and \K only drops the matched text from the match keeping the index where it is. This means that \K is no better than capturing groups.

So, a value\s*=\s*\K\d+ PCRE/Onigmo pattern would translate into this Java code:

String s = "Min value = 5000 km";
Matcher m = Pattern.compile("value\\s*=\\s*(\\d+)").matcher(s);
if(m.find()) {
    System.out.println(m.group(1));
}

There is an alternative, but that can only be used with smaller, simpler patterns. A constrained width lookbehind:

Java accepts quantifiers within lookbehind, as long as the length of the matching strings falls within a pre-determined range. For instance, (?<=cats?) is valid because it can only match strings of three or four characters. Likewise, (?<=A{1,10}) is valid.

So, this will also work:

    m = Pattern.compile("(?<=value\\s{0,10}=\\s{0,10})\\d+").matcher(s);
    if(m.find()) {
        System.out.println(m.group());
    }
    

See the Java demo.

like image 73
Wiktor Stribiżew Avatar answered Sep 23 '22 15:09

Wiktor Stribiżew