Can you use backreferences in a lookbehind? Let's say I want to <code>split</code> wherever behind me a character is repeated twice. <pre class="prettyprint"><code> String REGEX1 = "(?<=(.)\\1)"; // DOESN'T WORK! String REGEX2 = "(?<=(?=(.)\\1)..)"; // WORKS! System.out.println(java.util.Arrays.toString( "Bazooka killed the poor aardvark (yummy!)" .split(REGEX2) )); // prints "[Bazoo, ka kill, ed the poo, r aa, rdvark (yumm, y!)]" </code></pre> Using <code>REGEX2</code> (where the backreference is in a lookahead nested inside a lookbehind) works, but <code>REGEX1</code> gives this error at run-time: <pre class="prettyprint"><code>Look-behind group does not have an obvious maximum length near index 8 (?<=(.)\1) ^ </code></pre> This sort of make sense, I suppose, because in general the backreference can capture a string of any length (if the regex compiler is a bit smarter, though, it could determine that <code>\1</code> is <code>(.)</code> in this case, and therefore has a finite length). So is there a way to use a backreference in a lookbehind? And if there isn't, can you always work around it using this nested lookahead? Are there other commonly-used techniques?

Looks like your suspicion is correct that backreferences generally can't be used in Java lookbehinds. The workaround you proposed makes the finite length of the lookbehind explicit and looks very clever to me. I was intrigued to find out what Python does with this regex. Python only supports fixed-length lookbehind, not finite-length like Java, but this regex is fixed length. I couldn't use <code>re.split()</code> directly because Python's <code>re.split()</code> never splits on an empty match, but I think I found a bug in <code>re.sub()</code>: <pre class="prettyprint"><code>>>> r=re.compile("(?<=(.)\\1)") >>> a=re.sub(r,"|", "Bazooka killed the poor aardvark (yummy!)") >>> a 'Bazo|oka kil|led the po|or a|ardvark (yum|my!)' </code></pre> The lookbehind matches between the two duplicate characters!

Backreferences in lookbehind

Q: Does SED support Lookbehind?

sed does not support lookaround assertions. For what it's worth, grep -P is also a nonstandard extension, though typically available on Linux (but not other platforms).

Q: What is a negative Lookbehind?

A negative lookbehind assertion asserts true if the pattern inside the lookbehind is not matched. Here is its syntax: (?<!...) For example, (? <! xyz)abc asserts that there cannot be the string, xyz , just before matching the string, abc .

Tags:

java

regex

backreference

lookbehind

Can you use backreferences in a lookbehind?

Let's say I want to split wherever behind me a character is repeated twice.

    String REGEX1 = "(?<=(.)\\1)"; // DOESN'T WORK!
    String REGEX2 = "(?<=(?=(.)\\1)..)"; // WORKS!

    System.out.println(java.util.Arrays.toString(
        "Bazooka killed the poor aardvark (yummy!)"
        .split(REGEX2)
    )); // prints "[Bazoo, ka kill, ed the poo, r aa, rdvark (yumm, y!)]"

Using REGEX2 (where the backreference is in a lookahead nested inside a lookbehind) works, but REGEX1 gives this error at run-time:

Look-behind group does not have an obvious maximum length near index 8
(?<=(.)\1)
        ^

This sort of make sense, I suppose, because in general the backreference can capture a string of any length (if the regex compiler is a bit smarter, though, it could determine that \1 is (.) in this case, and therefore has a finite length).

So is there a way to use a backreference in a lookbehind?

And if there isn't, can you always work around it using this nested lookahead? Are there other commonly-used techniques?

703

asked Apr 29 '10 05:04

polygenelubricants

1 Answers

Looks like your suspicion is correct that backreferences generally can't be used in Java lookbehinds. The workaround you proposed makes the finite length of the lookbehind explicit and looks very clever to me.

I was intrigued to find out what Python does with this regex. Python only supports fixed-length lookbehind, not finite-length like Java, but this regex is fixed length. I couldn't use re.split() directly because Python's re.split() never splits on an empty match, but I think I found a bug in re.sub():

>>> r=re.compile("(?<=(.)\\1)")
>>> a=re.sub(r,"|", "Bazooka killed the poor aardvark (yummy!)")
>>> a
'Bazo|oka kil|led the po|or a|ardvark (yum|my!)'

The lookbehind matches between the two duplicate characters!

145

answered Oct 13 '22 23:10

Tim Pietzcker

Related questions
                            
                                Android Test new Service in Test Package
                            
                                ORA-01461 for inherited char(1 byte) column - need to make it work using Spring JDBC (extending StoredProcedure)
                            
                                How to access Javascript module with Duktape in Android
                            
                                Implement (/inherit/~extend) annotation in Kotlin
                            
                                Implementing methods using default methods of interfaces - Contradictory?
                            
                                Get source code of any class from within a Java program
                            
                                Hibernate Join two unrelated table when both has Composite Primary Key
                            
                                Spring boot cold start
                            
                                Patterns: Create and translate between data objects and wire formats
                            
                                how get current CPU temperature programmatically in all Android Versions?
                            
                                Hibernate thread-safe idempotent upsert without constraint exception handling?
                            
                                Why do Intellij code coverage and jacoco code coverage show different percentages?
                            
                                Can I use identityHashCode to produce a compareTo between Objects respecting same-ness?
                            
                                Testing onbeforeunload events from Selenium
                            
                                How to send interrupt key sequence to a Java Process?
                            
                                Handling Character Encoding in URI on Tomcat
                            
                                What is the best Java SIP Stack around? [closed]
                            
                                Is there any simple Java FTP Server libraries that is embeddable?
                            
                                How can I reuse a HttpClient connection efficiently?
                            
                                Find current heap size with jmap

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With