Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find a last occurrence of set of characters in string using regex in java?

I need find the last index of set of characters in a string. Consider the set of characters be x,y,z and string as Vereador Luiz Pauly Home then I need index as 18.

So for finding the index I have created a pattern with DOTALL flag and greedy quantifier as (?s).*(x|y|z). When the pattern is applied to that string(multiline), I can find out index from the start group. The code:

int findIndex(String str){
  int index = -1;
  Pattern p = Pattern.compile("(?s).*(x|y|z)");
  Matcher m = regex.matcher(str);
  if(m.find()){
    index = m.start(1);
  }
  return index;
}

As expected it is returning the values correctly, if there is match.

But if there is no match, then it takes too long time (17 minutes for 600000 characters) as it is a Greedy match.

I tried with other quantifiers, but can't get the desired output. So can anyone refer any better regex?

PS: I can also think about traversing the content from last and finding the index.But I hope there is some better way in regex which can do the job quickly.

like image 851
darklearner07 Avatar asked Jun 04 '19 09:06

darklearner07


People also ask

What is end of string in regex?

End of String or Line: $ The $ anchor specifies that the preceding pattern must occur at the end of the input string, or before \n at the end of the input string. If you use $ with the RegexOptions. Multiline option, the match can also occur at the end of a line.

How do I find a character in a string in regex?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

What is \\ s+ in regex Java?

The plus sign + is a greedy quantifier, which means one or more times. For example, expression X+ matches one or more X characters. Therefore, the regular expression \s matches a single whitespace character, while \s+ will match one or more whitespace characters.

What is \r in Java regex?

The \r metacharacter matches carriage return characters.


2 Answers

There are few ways to solve the problem and the best way will depend on the size of the input and the complexity of the pattern:

  1. Reverse the input string and possibly the pattern, this might work for non-complex patterns. Unfortunately java.util.regex doesn't allow to to match the pattern from right to left.

  2. Instead of using a greedy quantifier simply match the pattern and loop Matcher.find() until last occurrence is found.

  3. Use a different regex engine with better performance e.g. RE2/J: linear time regular expression matching in Java.

If option 2 is not efficient enough for your case I'd suggest to try RE2/J:

Java's standard regular expression package, java.util.regex, and many other widely used regular expression packages such as PCRE, Perl and Python use a backtracking implementation strategy: when a pattern presents two alternatives such as a|b, the engine will try to match subpattern a first, and if that yields no match, it will reset the input stream and try to match b instead.

If such choices are deeply nested, this strategy requires an exponential number of passes over the input data before it can detect whether the input matches. If the input is large, it is easy to construct a pattern whose running time would exceed the lifetime of the universe. This creates a security risk when accepting regular expression patterns from untrusted sources, such as users of a web application.

In contrast, the RE2 algorithm explores all matches simultaneously in a single pass over the input data by using a nondeterministic finite automaton.

like image 107
Karol Dowbecki Avatar answered Nov 07 '22 16:11

Karol Dowbecki


Performance issues with the (?s).*(x|y|z) regex come from the fact the .* pattern is the first subpattern that grabs the whole string first, and then backtracking occurs to find x, y or z. If there is no match, or the match is at the start of the string, and the strings is very large, this might take a really long time.

The ([xyz])(?=[^xyz]*$) pattern seems a little bit better: it captures x, y or z and asserts there is no other x, y or z up to the end of the string, but it also is somewhat resource-consuming due to each lookahead check after a match is found.

The fastest regex to get your job done is

^(?:[^xyz]*+([xyz]))+

It matches

  • ^ - start of string
  • (?:[^xyz]*+([xyz]))+ - 1 or more repetitions of
    • [^xyz]*+ - any 0 or more chars other than x, y and z matched possessively (no backtracking into the pattern is allowed)
    • ([xyz]) - Group 1: x, y or z.

The Group 1 value and data will belong to the last iteration of the repeated group (as all the preceding data is re-written with each subsequent iteration).

like image 21
Wiktor Stribiżew Avatar answered Nov 07 '22 18:11

Wiktor Stribiżew