Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding longest regex match in Java?

Tags:

java

regex

I have this:

import java.util.regex.*;

String regex = "(?<m1>(hello|universe))|(?<m2>(hello world))";
String s = "hello world";

Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(s);
while(matcher.find()) {
  MatchResult matchResult = m.toMatchResult();
  String substring = s.substring(matchResult.start(), matchResult.end());
  System.out.println(substring);
}

The above only prints hello whereas I want it to print hello world.

One way to fix this is to re-order the groups in String regex = "(?<m2>(hello world))|(?<m1>(hello|universe))" but I don't have control over the regex I get in my case...

So what is the best way to find the longest match? An obvious way would be to check all possible substrings of s as mentioned here (Efficiently finding all overlapping matches for a regular expression) by length and pick the first but that is O(n^2). Can we do better?

like image 388
pathikrit Avatar asked Feb 28 '17 00:02

pathikrit


3 Answers

Here is a way of doing it using matcher regions, but with a single loop over the string index:

public static String findLongestMatch(String regex, String s) {
    Pattern pattern = Pattern.compile("(" + regex + ")$");
    Matcher matcher = pattern.matcher(s);
    String longest = null;
    int longestLength = -1;
    for (int i = s.length(); i > longestLength; i--) {
        matcher.region(0, i);
        if (matcher.find() && longestLength < matcher.end() - matcher.start()) {
            longest = matcher.group();
            longestLength = longest.length();
        }
    }
    return longest;
}

I'm forcing the pattern to match until the region's end, and then I move the region's end from the rightmost string index towards the left. For each region's end tried, Java will match the leftmost starting substring that finishes at that region's end, i.e. the longest substring that ends at that place. Finally, it's just a matter of keeping track of the longest match found so far.

As a matter of optimization, and since I start from the longer regions towards the shorter ones, I stop the loop as soon as all regions that would come after are already shorter than the length of longest substring already found.


An advantage of this approach is that it can deal with arbitrary regular expressions and no specific pattern structure is required:

findLongestMatch("(?<m1>(hello|universe))|(?<m2>(hello world))", "hello world")
==> "hello world"

findLongestMatch("hello( universe)?", "hello world")
==> "hello"

findLongestMatch("hello( world)?", "hello world")
==> "hello world"

findLongestMatch("\\w+|\\d+", "12345 abc")
==> "12345"
like image 90
Helder Pereira Avatar answered Nov 20 '22 02:11

Helder Pereira


If you are dealing with just this specific pattern:

  1. There is one or more named group on the highest level connected by |.
  2. The regex for the group is put in superfluous braces.
  3. Inside those braces is one or more literal connected by |.
  4. Literals never contain |, ( or ).

Then it is possible to write a solution by extracting the literals, sorting them by their length and then returning the first match:

private static final Pattern g = Pattern.compile("\\(\\?\\<[^>]+\\>\\(([^)]+)\\)\\)");

public static final String findLongestMatch(String s, Pattern p) {
    Matcher m = g.matcher(p.pattern());
    List<String> literals = new ArrayList<>();
    while (m.find())
        Collections.addAll(literals, m.group(1).split("\\|"));
    Collections.sort(literals, new Comparator<String>() {
        public int compare(String a, String b) {
            return Integer.compare(b.length(), a.length());
        }
    });
    for (Iterator<String> itr = literals.iterator(); itr.hasNext();) {
         String literal = itr.next();
         if (s.indexOf(literal) >= 0)
              return literal;
    }
    return null;
}

Test:

System.out.println(findLongestMatch(
    "hello world",
    Pattern.compile("(?<m1>(hello|universe))|(?<m2>(hello world))")
));
// output: hello world
System.out.println(findLongestMatch(
    "hello universe",
    Pattern.compile("(?<m1>(hello|universe))|(?<m2>(hello world))")
));
// output: universe
like image 31
maraca Avatar answered Nov 20 '22 03:11

maraca


just add the $ (End of string) before the Or separator |.
Then it check whether the string is ended of not. If ended, it will return the string. Otherwise skip that part of regex.

The below code gives what you want

import java.util.regex.*;
public class RegTest{
  public static void main(String[] arg){
        String regex = "(?<m1>(hello|universe))$|(?<m2>(hello world))";
        String s = "hello world";
        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(s);
        while(matcher.find()) {
            MatchResult matchResult = matcher.toMatchResult();
            String substring = s.substring(matchResult.start(), matchResult.end());
            System.out.println(substring);
        }
    }
}

Likewise, the below code will skip hello , hello world and match hello world there
See the usage of $ there

import java.util.regex.*;
public class RegTest{
  public static void main(String[] arg){
        String regex = "(?<m1>(hello|universe))$|(?<m2>(hello world))$|(?<m3>(hello world there))";
        String s = "hello world there";
        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(s);
        while(matcher.find()) {
            MatchResult matchResult = matcher.toMatchResult();
            String substring = s.substring(matchResult.start(), matchResult.end());
            System.out.println(substring);
        }
    }
}
like image 1
Sagar V Avatar answered Nov 20 '22 03:11

Sagar V