Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract all occurrences of pattern K and check if string matches "K*" in 1 pass

Tags:

java

regex

For a given input string and a given pattern K, I want to extract every occurrence of K (or some part of it (using groups)) from the string and check that the entire string matches K* (as in it consists of 0 or more K's with no other characters).

But I would like to do this in a single pass using regular expressions. More specifically, I'm currently finding the pattern using Matcher.find, but this is not strictly required.

How would I do this?

I already found a solution (and posted an answer), but would like to know if there is specific regex or Matcher functionality that addresses / can address this issue, or simply if there are better / different ways of doing it. But, even if not, I still think it's an interesting question.

Example:

Pattern: <[0-9]> (a single digit in <>)

Valid input: <1><2><3>

Invalid inputs:

<1><2>a<3>
<1><2>3
Oh look, a flying monkey!
<1><2><3

Code to do it in 2 passes with matches:

boolean products(String products)
{
    String regex = "(<[0-9]>)";
    Pattern pAll = Pattern.compile(regex + "*");

    if (!pAll.matcher(products).matches())
        return false;

    Pattern p = Pattern.compile(regex);
    Matcher matcher = p.matcher(products);

    while (matcher.find())
        System.out.println(matcher.group());

    return true;
}
like image 512
Bernhard Barker Avatar asked May 16 '13 11:05

Bernhard Barker


2 Answers

1. Defining the problem

Since it is not clear what to output when the whole string does not match pattern K*, I will redefine the problem to make it clear what to output in such case.

Given any pattern K:

  • Check that the string has the pattern K*.
  • If the string has pattern K*, then split the string into non-overlapping tokens that matches K.
  • If the string only has prefix that matches pattern K*, then pick the prefix that is chosen by K*+1, and split the prefix into tokens that matches K.

1 I don't know if there is anyway to get the longest prefix that matches K. Of course, you can always remove the last character one by one and test against K* until it matches, but it is obviously inefficient.

Unless specify otherwise, whatever I write below will follow my problem description above. Note that the 3rd bullet point of the problem is to resolve the ambiguity on which prefix string to take.

2. Repeated capturing group in .NET

The problem above can be solved if we have the solution to the problem:

Given a pattern (K)*, which is a repeated capturing group, get the captured text for all the repetitions, instead of only the last repetition.

  • In the case where the string has pattern K*, by matching against ^(K)*$, we can get all tokens that match pattern K.
  • In the case where the string only has prefix that matches K*, by matching against ^(K)*, we can get all tokens that match pattern K.

This is the case in .NET regex, since it keeps all the captured text for a repeated capturing group.

However, since we are using Java, we don't have access to such feature.

3. Solution in Java

Checking that the string has the pattern K* can always be done with Matcher.matches()/String.matches(), since the engine will do full-blown backtracking on the input string to somehow "unify" K* with the input string. The hard thing is to split the input string into tokens that matches pattern K.

If K* is equivalent to K*+

If the pattern K has the property:

For all strings2, K* is equivalent to K*+, i.e. how the input string is split up into tokens that match pattern K is the same.

2 You can define this condition for only the input strings you are operating on, but ensuring this pre-condition is not easy. When you define it for all strings, you only need to analyze your regex to check whether the condition holds or not.

Then a one-pass solution that solves the problem can be constructed. You can repeatedly use Matcher.find() on the pattern \GK, and checks that the last match found is right at the end of the string. This is similar to your current solution, except that you do the boundary check with code.

The + after the quantifier * in K*+ makes the quantifier possessive. Possessive quantifier will prevent the engine from backtracking, which means each repetition is always the first possible match for the pattern K. We need this property so that the solution \GK has equivalent meaning, since it will also return the first possible match for the pattern K.

If K* is NOT equivalent to K*+

Without the property above, we need 2 passes to solve the problem. First pass to call Matcher.matches()/String.matches() on the pattern K*. On second pass:

  • If the string does not match pattern K*, we will repeatedly use Matcher.find() on the pattern \GK until no more match can be found. This can be done due to how we define which prefix string to take when the input string does not match pattern K*.

  • If the string matches pattern K*, repeatedly use Matcher.find() on the pattern \GK(?=K*$) is one solution. This will result in redundant work matching the rest of the input string, though.

Note that this solution is universally applicable for any K. In other words, it also applies for the case where K* is equivalent to K*+ (but we will use the better one-pass solution for that case instead).

like image 152
nhahtdh Avatar answered Sep 29 '22 21:09

nhahtdh


Here is an additional answer to the already accepted one. Below is an example code snippet that only goes through the pattern once with m.find(), which is similar to your one pass solution, but will not parse non-matching lines.

import java.util.regex.*;

class test{
    public static void main(String args[]){
        String t = "<1><2><3>";
        Pattern pat = Pattern.compile("(<\\d>)(?=(<\\d>)*$)(?<=^(<\\d>)*)");
        Matcher m = pat.matcher(t);
        while (m.find()) {
            System.out.println("Matches!");
            System.out.println(m.group());
        }       

    }
}

The regex explained:

<\\d> --This is your k pattern as defined above
?= -- positive lookahead (check what is ahead of K)
<\\d>* -- Match k 0 or more times
$ -- End of line
?<= -- positive lookbehind (check what is behind K)
^ -- beginning of line
<\\d>* -- followed by 0 or more Ks

Regular expressions are beautiful things.

Edit: As pointed out to me by @nhahtdh, this is just an implemented version of the answer. In fact the implementation above can be improved with the knowledge in the answer.
(<\\d>)(?=(<\\d>)*$)(?<=^(<\\d>)*) can be changed to \\G<\\d>(?=(<\\d>)*$).

like image 34
Scott Avatar answered Sep 29 '22 21:09

Scott