This article says that regexp matching in Java is slow because regexps with "back references" cannot be matched efficiently. The article explains efficient Thomson's NFA-based matching algorithm (invented in 1968) which works for regexps without "back references". However the Pattern
javadoc says Java regexps use NFA-based approach.
Now I wonder how efficient Java regexp matching is and what algorithm it uses.
A regular expression (sometimes called a rational expression) is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. “find and replace”-like operations.
Most library implementations of regular expressions use a backtracking algorithm that can take an exponential amount of time on some inputs.
java.util.regex.Pattern
uses Boyer–Moore string search algorithm
/* Attempts to match a slice in the input using the Boyer-Moore string
* matching algorithm. The algorithm is based on the idea that the
* pattern can be shifted farther ahead in the search text if it is
* matched right to left.
*/
private void compile() {
----------------------
-----------------------
if (matchRoot instanceof Slice) {
root = BnM.optimize(matchRoot);
if (root == matchRoot) {
root = hasSupplementary ? new StartS(matchRoot) : new Start(matchRoot);
}
} else if (matchRoot instanceof Begin || matchRoot instanceof First) {
root = matchRoot;
} else {
root = hasSupplementary ? new StartS(matchRoot) : new Start(matchRoot);
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With