How to find a last occurrence of set of characters in string using regex in java?

Tags:

I need find the last index of set of characters in a string. Consider the set of characters be x,y,z and string as Vereador Luiz Pauly Home then I need index as 18.

So for finding the index I have created a pattern with DOTALL flag and greedy quantifier as (?s).*(x|y|z). When the pattern is applied to that string(multiline), I can find out index from the start group. The code:

int findIndex(String str){
  int index = -1;
  Pattern p = Pattern.compile("(?s).*(x|y|z)");
  Matcher m = regex.matcher(str);
  if(m.find()){
    index = m.start(1);
  }
  return index;
}

As expected it is returning the values correctly, if there is match.

But if there is no match, then it takes too long time (17 minutes for 600000 characters) as it is a Greedy match.

I tried with other quantifiers, but can't get the desired output. So can anyone refer any better regex?

PS: I can also think about traversing the content from last and finding the index.But I hope there is some better way in regex which can do the job quickly.

851

asked Jun 04 '19 09:06

darklearner07

2 Answers

There are few ways to solve the problem and the best way will depend on the size of the input and the complexity of the pattern:

Reverse the input string and possibly the pattern, this might work for non-complex patterns. Unfortunately java.util.regex doesn't allow to to match the pattern from right to left.
Instead of using a greedy quantifier simply match the pattern and loop Matcher.find() until last occurrence is found.
Use a different regex engine with better performance e.g. RE2/J: linear time regular expression matching in Java.

If option 2 is not efficient enough for your case I'd suggest to try RE2/J:

Java's standard regular expression package, java.util.regex, and many other widely used regular expression packages such as PCRE, Perl and Python use a backtracking implementation strategy: when a pattern presents two alternatives such as a|b, the engine will try to match subpattern a first, and if that yields no match, it will reset the input stream and try to match b instead.

If such choices are deeply nested, this strategy requires an exponential number of passes over the input data before it can detect whether the input matches. If the input is large, it is easy to construct a pattern whose running time would exceed the lifetime of the universe. This creates a security risk when accepting regular expression patterns from untrusted sources, such as users of a web application.

In contrast, the RE2 algorithm explores all matches simultaneously in a single pass over the input data by using a nondeterministic finite automaton.

107

answered Nov 07 '22 16:11

Karol Dowbecki

Performance issues with the (?s).*(x|y|z) regex come from the fact the .* pattern is the first subpattern that grabs the whole string first, and then backtracking occurs to find x, y or z. If there is no match, or the match is at the start of the string, and the strings is very large, this might take a really long time.

The ([xyz])(?=[^xyz]*$) pattern seems a little bit better: it captures x, y or z and asserts there is no other x, y or z up to the end of the string, but it also is somewhat resource-consuming due to each lookahead check after a match is found.

The fastest regex to get your job done is

^(?:[^xyz]*+([xyz]))+

It matches

^ - start of string
(?:[^xyz]*+([xyz]))+ - 1 or more repetitions of
- [^xyz]*+ - any 0 or more chars other than x, y and z matched possessively (no backtracking into the pattern is allowed)
- ([xyz]) - Group 1: x, y or z.

The Group 1 value and data will belong to the last iteration of the repeated group (as all the preceding data is re-written with each subsequent iteration).

answered Nov 07 '22 18:11

Wiktor Stribiżew

Related questions
                            
                                Make asynchronous call synchronous in Kotlin
                            
                                Validate UUID Restful service
                            
                                How to fix "Driver does not support get/set network timeout for connections" while connecting to oracle database from spring boot app?
                            
                                How to customize DefaultHandlerExceptionResolver logic?
                            
                                Why is there a difference between two similar implementations of a 'for' loop?
                            
                                Implementing Monads in Java 8
                            
                                Problems starting Eclipse with OpenJDK 11 on Windows 10
                            
                                gradle maven-publish plugin adds timestamp, how to avoid putting it into suffix
                            
                                Spring Boot connection to Postgresql with SSL
                            
                                How to convert a date time string to long (UNIX Epoch Time) in Java 8 (Scala)
                            
                                How to use Jackson to deserialize external Lombok builder class
                            
                                Cannot compile a java library with Gradle, works with the IDE
                            
                                How to transform the context of fragment into a LifecycleOwner?
                            
                                How to run JUnit 5 and JUnit 4 test suites in Gradle?
                            
                                How to convert multiple attributes of object into List<String> using java 8
                            
                                Should I use try-with-resource in flatMap for an I/O-based stream?
                            
                                How to implement Decorator pattern in Spring Boot
                            
                                Java 8: How to get a value from a list contained as a map value?
                            
                                which compiler is used when -server -XX:+UnlockExperimentalVMOptions -XX:+EnableJVMCI -XX:+UseJVMCICompiler
                            
                                Create JPEG thumb image with general fixed header

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to find a last occurrence of set of characters in string using regex in java?

Tags:

java

regex

regex-greedy

darklearner07

People also ask

2 Answers

Karol Dowbecki

Wiktor Stribiżew

Recent Activity

Donate For Us