Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regular expression goes into infinite loop

Tags:

java

regex

I am parsing (species) names of the form:

Parus Ater
H. sapiens
T. rex
Tyr. rex

which normally have two terms (binomial) but sometimes have 3 or more.

Troglodytes troglodytes troglodytes 
E. rubecula sensu stricto

I wrote

[A-Z][a-z]*\.?\s+[a-z][a-z]+(\s*[a-z]+)*

which worked most of the time but occasionally went into an infinite loop. It took some time to track down that it was in the regex matching and then I realised it was a typo and I should have written

[A-Z][a-z]*\.?\s+[a-z][a-z]+(\s+[a-z]+)*

which performs properly.

My questions are:

  • why does this loop happen?
  • is there a way I can check for similar regex errors before running the program? Otherwise it may be difficult to trap them before the prgram is distributed and cause problems.

[Note: I don't need a more general expression for species - there is a formal 100+ line regex specification for Species names - this was just an initial filter].

NOTE: The problem arose because although most names were extracted precisely into 2 or occasionally 3/4 terms (as they were in italics) there were a few false positives (like "Homo sapiens lives in big cities like London") and the match fails at "L".]

NOTE: In debugging this I have found that the regex was often completing but being very slow (e.g. on shorter target strings). It is valuable that I found this bug through a pathological case. I have learnt an important lesson!

like image 265
peter.murray.rust Avatar asked Apr 10 '13 07:04

peter.murray.rust


People also ask

Can a regex be infinite?

Thus there is no regular expression which can define same language as the language defined by union of infinite regular expressions. Thus regular expressions can have only finite expressions.

How do you know if a loop is infinite?

An infinite loop is a sequence of instructions in a computer program which loops endlessly, either due to the loop having no terminating condition, having one that can never be met, or one that causes the loop to start over.

How do you repeat a regular expression?

An expression followed by '*' can be repeated any number of times, including zero. An expression followed by '+' can be repeated any number of times, but at least once. An expression followed by '? ' may be repeated zero or one times only.

Are infinite while loops possible?

A common infinite loop occurs when the condition of the while statement is set to true . Below is an example of code that will run forever. It is not necessary to test any infinite loops. An infinite loop will run forever, but the program can be terminated with the break keyword.


1 Answers

To address the first part of your question, you should read up on catastrophic backtracking. Essentially, what is happening is there are too many ways to match your regular expression with your string, and the parser is continually back tracking to try and make it work.

In your case, it was probably the nested repitition: (\s*[a-z]+)* Which likely caused some very very strange loops. As Qtax has adeptly pointed out, it's hard to tell without more information.

The second part of your question is, unfortunately, impossible to answer. It's basically the Halting problem. Since Regular Expressions are essentially of a finite state machine whose input is a string, you cannot create a general solution which predicts which regular expressions will backtrack catastrophically, and which will not.

As far as some tips for making your regular expressions run faster? That's a big can of worms. I've spent a lot of time studying regular expressions on my own, and some time optimizing them, and here's what I've found generally helps:

  1. Compile your regular expressions outside of your loops, if your language supports it.
  2. Whenever possible, add anchors when you know they're useful. Especially the ^ for the beginning of the string. See also: Word Boundaries
  3. Avoid nested repetition like the plague. If you have to have it (which you will), do your best to provide hints to the engine to short circuit any unintended backtracking.
  4. Take advantage of flavor constructs to speed things up. I'm partial to Non-Capturing groups and possessive quantifiers. They don't appear in every flavor, but when they do, you should use them. Also check out Atomic Groups
  5. I always find this to be true: The longer your regular expression gets, The more trouble you're going to have making it efficient. Regular expressions are a great and powerful tool, they're like a super smart hammer. Don't fall into the trap of seeing everything as a nail. Sometimes the string function you're looking for is right under your nose.

Hope this helps you. Good luck.

like image 63
FrankieTheKneeMan Avatar answered Sep 28 '22 08:09

FrankieTheKneeMan