Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex backreferences in Java

Tags:

java

regex

I had to match a number followed by itself 14 times. Then I've came to the following regular expression in the regexstor.net/tester:

(\d)\1{14}

Edit

When I paste it in my code, including the backslashes properly:

"(\\d)\\1{14}"

I've replaced the back-reference "\1" by the "$1" which is used to replace matches in Java.

Then I've realized that it doesn't work. When you need to back-reference a match in the REGEX, in Java, you have to use "\N", but when you want to replace it, the operator is "$N".

My question is: why?

like image 631
Jaumzera Avatar asked Jun 09 '16 19:06

Jaumzera


People also ask

What is back referencing in regex?

back-references are regular expression commands which refer to a previous part of the matched regular expression. Back-references are specified with backslash and a single digit (e.g. ' \1 '). The part of the regular expression they refer to is called a subexpression, and is designated with parentheses.

What does \\ mean in Java regex?

String regex = "\\."; Notice that the regular expression String contains two backslashes after each other, and then a . . The reason is, that first the Java compiler interprets the two \\ characters as an escaped Java String character. After the Java compiler is done, only one \ is left, as \\ means the character \ .

How do I reference a capture group in regex?

If your regular expression has named capturing groups, then you should use named backreferences to them in the replacement text. The regex (?' name'group) has one group called “name”. You can reference this group with ${name} in the JGsoft applications, Delphi, .


2 Answers

$1 is not a back reference in Java's regexes, nor in any other flavor I can think of. You only use $1 when you are replacing something:

String input="A12.3 bla bla my input";
input = StringUtils.replacePattern(
            input, "^([A-Z]\\d{2}\\.\\d).*$", "$1");
//                                            ^^^^

There is some misinformation about what a back reference is, including the very place I got that snippet from: simple java regex with backreference does not work.


Java modeled its regex syntax after other existing flavors where the $ was already a meta character. It anchors to the end of the string (or line in multi-line mode).

Similarly, Java uses \1 for back references. Because regexes are strings, it must be escaped: \\1.

From a lexical/syntactic standpoint it is true that $1 could be used unambiguously (as a bonus it would prevent the need for the "evil escaped escape" when using back references).

To match a 1 that comes after the end of a line the regex would need to be $\n1:

this line
1

It just makes more sense to use a familiar syntax instead of changing the rules, most of which came from Perl.

The first version of Perl came out in 1987, which is much earlier than Java, which was released in beta in 1995.

I dug up the man pages for Perl 1, which say:

The bracketing construct (\ ...\ ) may also be used, in which case \<digit> matches the digit'th substring. (Outside of the pattern, always use $ instead of \ in front of the digit. The scope of $<digit> (and $\`, $& and $') extends to the end of the enclosing BLOCK or eval string, or to the next pattern match with subexpressions. The \<digit> notation sometimes works outside the current pattern, but should not be relied upon.) You may have as many parentheses as you wish. If you have more than 9 substrings, the variables $10, $11, ... refer to the corresponding substring. Within the pattern, \10, \11, etc. refer back to substrings if there have been at least that many left parens before the backreference. Otherwise (for backward compatibilty) \10 is the same as \010, a backspace, and \11 the same as \011, a tab. And so on. (\1 through \9 are always backreferences.)

like image 142
Laurel Avatar answered Nov 03 '22 17:11

Laurel


I think the main Problem is not the backreference - which works perfectly fine with \1 in java.

Your Problem is more likely the "overall" escaping of a regex pattern in Java.

If you want to have the pattern

(\d)\1{14}

passed to the regex engine, you first need to escape it cause it's a java-string when you write it:

(\\d)\\1{14}

Voila, works like a charm: goo.gl/BNCx7B (add http://, SO does not allow Url-Shorteners, but tutorialspoint.com has no other option as it seems)

Offline-Example:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class HelloWorld{

     public static void main(String []args){
        String test = "555555555555555"; // 5 followed by 5 for 14 times.

        String pattern = "(\\d)\\1{14}";

        Pattern r = Pattern.compile(pattern);
        Matcher m = r.matcher(test);
        if (m.find( )) {
           System.out.println("Matched!");   
        }else{
           System.out.println("not matched :-(");    
        }
     }
}
like image 42
dognose Avatar answered Nov 03 '22 17:11

dognose