Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Backreferences in lookbehind

Can you use backreferences in a lookbehind?

Let's say I want to split wherever behind me a character is repeated twice.

    String REGEX1 = "(?<=(.)\\1)"; // DOESN'T WORK!
    String REGEX2 = "(?<=(?=(.)\\1)..)"; // WORKS!

    System.out.println(java.util.Arrays.toString(
        "Bazooka killed the poor aardvark (yummy!)"
        .split(REGEX2)
    )); // prints "[Bazoo, ka kill, ed the poo, r aa, rdvark (yumm, y!)]"

Using REGEX2 (where the backreference is in a lookahead nested inside a lookbehind) works, but REGEX1 gives this error at run-time:

Look-behind group does not have an obvious maximum length near index 8
(?<=(.)\1)
        ^

This sort of make sense, I suppose, because in general the backreference can capture a string of any length (if the regex compiler is a bit smarter, though, it could determine that \1 is (.) in this case, and therefore has a finite length).

So is there a way to use a backreference in a lookbehind?

And if there isn't, can you always work around it using this nested lookahead? Are there other commonly-used techniques?

like image 703
polygenelubricants Avatar asked Apr 29 '10 05:04

polygenelubricants


People also ask

What is a positive Lookbehind regex?

Positive and Negative LookbehindIt tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there.

What is Lookbehind assertion?

Lookbehind assertion: Matches "x" only if "x" is preceded by "y". For example, /(? <=Jack)Sprat/ matches "Sprat" only if it is preceded by "Jack". /(?

Does SED support Lookbehind?

sed does not support lookaround assertions. For what it's worth, grep -P is also a nonstandard extension, though typically available on Linux (but not other platforms).

What is a negative Lookbehind?

A negative lookbehind assertion asserts true if the pattern inside the lookbehind is not matched. Here is its syntax: (?<!...) For example, (? <! xyz)abc asserts that there cannot be the string, xyz , just before matching the string, abc .


1 Answers

Looks like your suspicion is correct that backreferences generally can't be used in Java lookbehinds. The workaround you proposed makes the finite length of the lookbehind explicit and looks very clever to me.

I was intrigued to find out what Python does with this regex. Python only supports fixed-length lookbehind, not finite-length like Java, but this regex is fixed length. I couldn't use re.split() directly because Python's re.split() never splits on an empty match, but I think I found a bug in re.sub():

>>> r=re.compile("(?<=(.)\\1)")
>>> a=re.sub(r,"|", "Bazooka killed the poor aardvark (yummy!)")
>>> a
'Bazo|oka kil|led the po|or a|ardvark (yum|my!)'

The lookbehind matches between the two duplicate characters!

like image 145
Tim Pietzcker Avatar answered Oct 13 '22 23:10

Tim Pietzcker