Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression matching algorithm in Java

This article says that regexp matching in Java is slow because regexps with "back references" cannot be matched efficiently. The article explains efficient Thomson's NFA-based matching algorithm (invented in 1968) which works for regexps without "back references". However the Pattern javadoc says Java regexps use NFA-based approach.

Now I wonder how efficient Java regexp matching is and what algorithm it uses.

like image 786
Michael Avatar asked Oct 08 '13 15:10

Michael


People also ask

What is regular expression matching?

A regular expression (sometimes called a rational expression) is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. “find and replace”-like operations.

What algorithm is used with regex?

Most library implementations of regular expressions use a backtracking algorithm that can take an exponential amount of time on some inputs.


1 Answers

java.util.regex.Pattern uses Boyer–Moore string search algorithm

/* Attempts to match a slice in the input using the Boyer-Moore string
 * matching algorithm. The algorithm is based on the idea that the
 * pattern can be shifted farther ahead in the search text if it is
 * matched right to left.
 */

private void compile() {
    ----------------------
    -----------------------

   if (matchRoot instanceof Slice) {
        root = BnM.optimize(matchRoot);
        if (root == matchRoot) {
            root = hasSupplementary ? new StartS(matchRoot) : new Start(matchRoot);
        }
    } else if (matchRoot instanceof Begin || matchRoot instanceof First) {
        root = matchRoot;
    } else {
        root = hasSupplementary ? new StartS(matchRoot) : new Start(matchRoot);
    }
}
like image 88
Prabhakaran Ramaswamy Avatar answered Nov 10 '22 00:11

Prabhakaran Ramaswamy