Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java regex : How to reuse a consumed character in pattern matching?

Tags:

java

regex

Is there a way to reuse a consumed character of the source in pattern matching?

For example, suppose I want to find a pattern with regex expression (a+b+|b+a+) i.e. more than one a followed by more than one b OR vice versa.

Suppose the input is aaaabbbaaaaab

Then the output using regex would be aaaabbb and aaaaab

How can I get the output to be

aaaabbb
bbbaaaaa
aaaaab
like image 632
dshgna Avatar asked Mar 31 '13 08:03

dshgna


People also ask

What does \\ mean in Java regex?

Backslashes in Java. The backslash \ is an escape character in Java Strings. That means backslash has a predefined meaning in Java. You have to use double backslash \\ to define a single backslash. If you want to define \w , then you must be using \\w in your regex.

Why * is used in regex?

- a "dot" indicates any character. * - means "0 or more instances of the preceding regex token"

How do you escape a character in regex Java?

We can use a backslash to escape characters. We require two backslashes as backslash is itself a character and needs to be escaped. Characters after \\ are escaped. It is generally used to escape characters at the end of the string.

How do you match a pattern in regex?

Using special characters For example, to match a single "a" followed by zero or more "b" s followed by "c" , you'd use the pattern /ab*c/ : the * after "b" means "0 or more occurrences of the preceding item."


2 Answers

Try this way

String data = "aaaabbbaaaaab";
Matcher m = Pattern.compile("(?=(a+b+|b+a+))(^|(?<=a)b|(?<=b)a)").matcher(data);
while(m.find())
    System.out.println(m.group(1));

This regex uses look around mechanisms and will find (a+b+|b+a+) that

  • exists at start ^ of the input
  • starts with b that is predicted by a
  • starts with a that is predicted by b.

Output:

aaaabbb
bbbaaaaa
aaaaab

Is ^ essentially needed in this regular expression?

Yes, without ^ this regex wouldn't capture aaaabbb placed at start of input.

If I wouldn't add (^|(?<=a)b|(?<=b)a) after (?=(a+b+|b+a+)) this regex would match

aaaabbb
aaabbb
aabbb
abbb
bbbaaaaa
bbaaaaa
baaaaa
aaaaab
aaaab
aaab
aab
ab

so I needed to limit this results to only these that starts with a that has b before it (but not include b in match - so look behind was perfect for that) and b that is predicted by a.

But lets not forget about a or b that are placed at start of the string and are not predicted by anything. To include them we can use ^.


Maybe it will be easier to show this idea with this regex

(?=(a+b+|b+a+))((?<=^|a)b|(?<=^|b)a).

  • (?<=^|a)b will match b that is placed at start of string, or has a before it
  • (?<=^|b)a will match a that is placed at start of string, or has b before it
like image 165
Pshemo Avatar answered Oct 21 '22 21:10

Pshemo


You can simulate this with lookbehind:

((?<=a)b+|(?<=b)a+)

This outputs

bbb aaaaa b
like image 31
nneonneo Avatar answered Oct 21 '22 21:10

nneonneo