Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

All overlapping substrings matching a java regex

Is there an API method that returns all (possibly overlapping) substrings that match a regular expression?

For example, I have a text string: String t = 04/31 412-555-1235;, and I have a pattern: Pattern p = new Pattern("\\d\\d+"); that matches strings of two or more characters.

The matches I get are: 04, 31, 412, 555, 1235.

How do I get overlapping matches?

I want the code to return: 04, 31, 41, 412, 12, 55, 555, 55, 12, 123, 1235, 23, 235, 35.

Theoretically it should be possible -- there is an obvious O(n^2) algorithm that enumerates and checks all the substrings against the pattern.

EDIT

Rather than enumerating all substrings, it is safer to use the region(int start, int end) method in Matcher. Checking the pattern against a separate, extracted substring might change the result of the match (e.g. if there is a non-capturing group or word boundary check at the start/end of the pattern).

EDIT 2

Actually, it's unclear whether region() does what you expect for zero-width matches. The specification is vague, and experiments yield disappointing results.

For example:

String line = "xx90xx";
String pat = "\\b90\\b";
System.out.println(Pattern.compile(pat).matcher(line).find()); // prints false
for (int i = 0; i < line.length(); ++i) {
  for (int j = i + 1; j <= line.length(); ++j) {
    Matcher m = Pattern.compile(pat).matcher(line).region(i, j);
    if (m.find() && m.group().size == (j - i)) {
      System.out.println(m.group() + " (" + i + ", " + j + ")"); // prints 90 (2, 4)
    }
  }
}

I'm not sure what the most elegant solution is. One approach would be to take a substring of line and pad with with the appropriate boundary characters before checking whether the pat matches.

EDIT 3

Here is the full solution that I came up with. It can handle zero-width patterns, boundaries, etc. in the original regular expression. It looks through all substrings of the text string and checks whether the regular expression matches only at the specific position by padding the pattern with the appropriate number of wildcards at the beginning and end. It seems to work for the cases I tried -- although I haven't done extensive testing. It is most certainly less efficient than it could be.

  public static void allMatches(String text, String regex)
  {
    for (int i = 0; i < text.length(); ++i) {
      for (int j = i + 1; j <= text.length(); ++j) {
        String positionSpecificPattern = "((?<=^.{"+i+"})("+regex+")(?=.{"+(text.length() - j)+"}$))";
        Matcher m = Pattern.compile(positionSpecificPattern).matcher(text);

        if (m.find()) 
        {   
          System.out.println("Match found: \"" + (m.group()) + "\" at position [" + i + ", " + j + ")");
        }   
      }   
    }   
  }

EDIT 4

Here's a better way of doing this: https://stackoverflow.com/a/11372670/244526

EDIT 5

The JRegex library supports finding all overlapping substrings matching a java regex (although it appears not to have been updated in a while). Specifically, the documentation on non-breaking search specifies:

Using non-breaking search you can finding all possible occureneces of a pattern, including those that are intersecting or nested. This is achieved by using the Matcher's method proceed() instead of find()

like image 348
dsg Avatar asked Nov 13 '22 02:11

dsg


1 Answers

I faced a similar situation and I tried the above answers but in my case it took too much of time by setting the start and end index of the matcher but I think I've found a better solution, I'm posting it here for others. So below is my code sniplet.

if (textToParse != null) {
Matcher matcher = PLACEHOLDER_PATTERN.matcher(textToParse);
    while(matcher.hitEnd()!=true){
        Boolean result = matcher.find();
        int count = matcher.groupCount();
        System.out.println("Result " +result+" count "+count);
        if(result==true && count==1){
            mergeFieldName = matcher.group(1);
            mergeFieldNames.add(mergeFieldName);
           }
       }
  }

I have used the matcher.hitEnd() method to check if i have reached the end of text.

Hope this helps. Thanks!

like image 162
Darshan Avatar answered Nov 16 '22 03:11

Darshan