Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding regex string matching using Dynamic Programming

I came across this problem that asks you to implement a regular expression matcher with support for '.' and '*', where

'.' Matches any single character.

'*' Matches zero or more of the preceding element.

isMatch("aa","a") → false
isMatch("aa","aa") → true
isMatch("aaa","aa") → false
isMatch("aa", "a*") → true
isMatch("aa", ".*") → true
isMatch("ab", ".*") → true
isMatch("aab", "c*a*b") → true

While I was able to solve this in a linear fashion, I came across lots of solutions that used DP, like the one below,

class Solution {
    public boolean isMatch(String text, String pattern) {
        boolean[][] dp = new boolean[text.length() + 1][pattern.length() + 1];
        dp[text.length()][pattern.length()] = true;

        for (int i = text.length(); i >= 0; i--){
            for (int j = pattern.length() - 1; j >= 0; j--){
                boolean first_match = (i < text.length() && 
                                       (pattern.charAt(j) == text.charAt(i) ||
                                        pattern.charAt(j) == '.'));
                if (j + 1 < pattern.length() && pattern.charAt(j+1) == '*'){
                    dp[i][j] = dp[i][j+2] || first_match && dp[i+1][j];
                } else {
                    dp[i][j] = first_match && dp[i+1][j+1];
                }
            }
        }
        return dp[0][0];
    }
}

I'm having a hard time understanding this. I've solved a few DP problems that involved grids (shortest path in 2d grid, largest square in binary 2d grid), using a DP table there made perfect sense to me. However here I'm completely lost, I'm unable to understand how traversing a 2d table helps in solving this problem. Further more it appears we know when the characters don't match in the loop, so I don't understand why we don't terminate the search there (this is also probably due to my lack of understanding of how a table traversal leads to a solution). Is there a clear intuitive explanation for problems like these?

like image 568
Rnet Avatar asked Mar 30 '18 09:03

Rnet


People also ask

How does regex matching work?

A regex pattern matches a target string. The pattern is composed of a sequence of atoms. An atom is a single point within the regex pattern which it tries to match to the target string. The simplest atom is a literal, but grouping parts of the pattern to match an atom will require using ( ) as metacharacters.

How do you match expressions in regex?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

What does '$' mean in regex?

$ means "Match the end of the string" (the position after the last character in the string).

How is regex used in NLP?

A regular expression (RE) is a language for specifying text search strings. RE helps us to match or find other strings or sets of strings, using a specialized syntax held in a pattern. Regular expressions are used to search texts in UNIX as well as in MS WORD in identical way.


1 Answers

The intuition to solve problem like this using DP is to figure out answer to following questions

  1. Can the problem be solved using recursion? Which means can it be represented in terms of smaller sub-problem of same type?
  2. Do smaller sub-problems get repeated in the recursion tree? If yes can the result of smaller problem be stored in a manner that whenever similar sub-problem is encountered result can be accessed in O(1). This is usually called memoization.

Lets first understand the problem's solution which you would have figured out while solving in linear fashion.

  1. While matching the text with pattern either first character will match or it will not match.

    Case 1: First character matches or first character of pattern is '.'

    Case 1.1 Next character is '*'

    Case 1.2 Next character is not '*'

    Case 2: First character does not match

    case 2.1 Next character is '*'

    case 2.2 Next character is not '*'

Lets now figure out two steps discussed earlier for above problem.

For case 1.1 examples are

isMatch("a", "a*a") and isMatch("aab", "a*b") which is equivalent to solving

isMatch("a", "a") || isMatch("", "a*a") and

isMatch("aab", "b") || isMatch("ab", "a*b") respectively. First part of || condition covers the scenario of optional character in pattern as it is followed by '*' and second part covers the scenario for repetition of matched character.

Similarly sub-problems can be formed for all the cases. case 2.2 should straightaway return false.

Below is the java code with recursive approach

public boolean isMatch(String text, String pattern) {
    dp = new Boolean[text.length()][pattern.length()];
    return isMatch(text, pattern, 0, 0);
}

private boolean isMatch(String text, String pattern, int ti, int pi) {

    if (pi == pattern.length() && ti < text.length()) return false;

    if (ti == text.length() && pi == pattern.length()) return true;

    if (ti == text.length()) {
        return isNextCharStar(pattern, pi) && isMatch(text, pattern, ti, pi + 2);
    }


    boolean result;
    final boolean hasFirstMatched = text.charAt(ti) == pattern.charAt(pi) || pattern.charAt(pi) == '.';

    if (isNextCharStar(pattern, pi)) {
        result = isMatch(text, pattern, ti, pi + 2);
        if (hasFirstMatched) {
            result = result || isMatch(text, pattern, ti + 1, pi);
        }
        return result;
    }

    return hasFirstMatched && isMatch(text, pattern, ti + 1, pi + 1);

}

private boolean isNextCharStar(String pattern, int pi) {
    return pi < pattern.length() - 1 && pattern.charAt(pi + 1) == '*';
}

Now lets apply the memoization step. If we start saving the result before returning we can reuse it next time. How and where should we save it? For all possible combinations of ti and pi we have to store the result. Where ti is text Index and pi is pattern index. So a 2d array of size text.length * pattern.length should be sufficient. Following is the code after above changes

Boolean [][] dp;

public boolean isMatch(String text, String pattern) {
    dp = new Boolean[text.length()][pattern.length()];
    return isMatch(text, pattern, 0, 0);
}

private boolean isMatch(String text, String pattern, int ti, int pi) {

    if (pi == pattern.length() ) return ti == text.length();

    if (ti == text.length()) {
        return isNextCharStar(pattern, pi) && isMatch(text, pattern, ti, pi + 2);
    }

    if(dp[ti][pi] != null) return dp[ti][pi];

    boolean result;
    final boolean hasFirstMatched = text.charAt(ti) == pattern.charAt(pi) || pattern.charAt(pi) == '.';

    if (isNextCharStar(pattern, pi)) {
        result = isMatch(text, pattern, ti, pi + 2);
        if (hasFirstMatched) {
            result = result || isMatch(text, pattern, ti + 1, pi);
        }
        dp[ti][pi] = result;
        return result;
    }

    dp[ti][pi] = hasFirstMatched && isMatch(text, pattern, ti + 1, pi + 1);
    return dp[ti][pi];

}

private boolean isNextCharStar(String pattern, int pi) {
    return pi < pattern.length() - 1 && pattern.charAt(pi + 1) == '*';
}

If you will look closely only 3 lines have been changed to make it a DP solution from a recursive solution.

like image 95
Abhishek Jha Avatar answered Oct 17 '22 16:10

Abhishek Jha