Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does this regular expression kill the Java regex engine?

Tags:

java

regex

I have this naive regex "<([\s]|[^<])+?>" (excluding the quotation marks). It seems so straightforward but it is indeed evil when it works against the below HTML text. It sends the Java regular expression engine to an infinite loop.

I have another regex ("<.+?>"), which does somewhat the same thing, but it doesn't kill anything. Do you know why this happens?

<script language="JavaScript" type="text/javascript">
        var numDivs, layerName;
        layerName = "lnavLayer";
        catLinkName = "category";
        numDivs = 2;
        function toggleLayer(layerID){
            if (!(navigator.appName == "Netscape" && navigator.appVersion.substr(0, 1) < 5)){
                thisLayer = document.getElementById(layerName + layerID);
                categoryLink = document.getElementById(catLinkName + layerID);
                closeThem();
                if (thisLayer.className == 'subnavDefault'){
                    thisLayer.className = 'subnavToggled';
                    categoryLink.className = 'leftnavLinkSelectedSection';
                }
            }
        }
        function closeThem(){
            for(x = 0; x < numDivs; x++){
                theLayer = document.getElementById(layerName + (x
+ 1));
                thecategoryLink = document.getElementById(catLinkName + (x + 1));
                theLayer.className = 'subnavDefault';
                thecategoryLink.className = 'leftnavLink';
            }
        } var flag = 0; var lastClicked = 0
    //-->
    </script>

it even keeps looping with an online Java regex tool (such as www.fileformat.info/tool/regex.htm) or a utility like RegexBuddy.

like image 639
Martin08 Avatar asked Nov 13 '08 23:11

Martin08


People also ask

Which regex engine does Java use?

Java Regex API provides 1 interface and 3 classes in java. util. regex package.

What is Java regex used for?

Regular expressions can be used to perform all types of text search and text replace operations. Java does not have a built-in Regular Expression class, but we can import the java. util. regex package to work with regular expressions.

Is regex faster than for loop Java?

Regex is faster for large string than an if (perhaps in a for loops) to check if anything matches your requirement.

Should I avoid regex?

There is no reason in the world to avoid throwing a regex at a string like "<a href='foo'>stuff</a>" . Modern regexes have no trouble with this.


1 Answers

The reason the Java regex engine crashes is that this part of your regex causes a stack overflow (indeed!):

[\s]|[^<]

What happens here is that every character matched by \s can also be matched by [^<]. That means there are two ways to match each whitespace character. If we represent the two character classes with A and B:

A|B

Then a string of three spaces could be matched as AAA, AAB, ABA, ABB, BAA, BAB, BBA, or BBB. In other words the complexity of this part of the regex is 2^N. This will kill any regex engine that doesn't have any safeguards against what I call catastrophic backtracking.

When using alternation (vertical bar) in a regex, always make sure the alternatives are mutually exclusive. That is, at most one of the alternatives may be allowed to match any given bit of text.

like image 92
Jan Goyvaerts Avatar answered Oct 02 '22 13:10

Jan Goyvaerts