I am parsing (species) names of the form:
Parus Ater
H. sapiens
T. rex
Tyr. rex
which normally have two terms (binomial) but sometimes have 3 or more.
Troglodytes troglodytes troglodytes
E. rubecula sensu stricto
I wrote
[A-Z][a-z]*\.?\s+[a-z][a-z]+(\s*[a-z]+)*
which worked most of the time but occasionally went into an infinite loop. It took some time to track down that it was in the regex matching and then I realised it was a typo and I should have written
[A-Z][a-z]*\.?\s+[a-z][a-z]+(\s+[a-z]+)*
which performs properly.
My questions are:
[Note: I don't need a more general expression for species - there is a formal 100+ line regex specification for Species names - this was just an initial filter].
NOTE: The problem arose because although most names were extracted precisely into 2 or occasionally 3/4 terms (as they were in italics) there were a few false positives (like "Homo sapiens lives in big cities like London"
) and the match fails at "L".]
NOTE: In debugging this I have found that the regex was often completing but being very slow (e.g. on shorter target strings). It is valuable that I found this bug through a pathological case. I have learnt an important lesson!
Thus there is no regular expression which can define same language as the language defined by union of infinite regular expressions. Thus regular expressions can have only finite expressions.
An infinite loop is a sequence of instructions in a computer program which loops endlessly, either due to the loop having no terminating condition, having one that can never be met, or one that causes the loop to start over.
An expression followed by '*' can be repeated any number of times, including zero. An expression followed by '+' can be repeated any number of times, but at least once. An expression followed by '? ' may be repeated zero or one times only.
A common infinite loop occurs when the condition of the while statement is set to true . Below is an example of code that will run forever. It is not necessary to test any infinite loops. An infinite loop will run forever, but the program can be terminated with the break keyword.
To address the first part of your question, you should read up on catastrophic backtracking. Essentially, what is happening is there are too many ways to match your regular expression with your string, and the parser is continually back tracking to try and make it work.
In your case, it was probably the nested repitition: (\s*[a-z]+)*
Which likely caused some very very strange loops. As Qtax has adeptly pointed out, it's hard to tell without more information.
The second part of your question is, unfortunately, impossible to answer. It's basically the Halting problem. Since Regular Expressions are essentially of a finite state machine whose input is a string, you cannot create a general solution which predicts which regular expressions will backtrack catastrophically, and which will not.
As far as some tips for making your regular expressions run faster? That's a big can of worms. I've spent a lot of time studying regular expressions on my own, and some time optimizing them, and here's what I've found generally helps:
^
for the beginning of the string. See also: Word Boundaries
Hope this helps you. Good luck.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With