Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex Pattern Catastrophic backtracking

I have the regex shown below used in one of my old Java systems which is causing backtracking issues lately. Quite often the backtracking threads cause the CPU of the machine to hit the upper limit and it does not return back until the application is restarted.

Could any one suggest a better way to rewrite this pattern or a tool which would help me to do so?

Pattern:

^\[(([\p{N}]*\]\,\[[\p{N}]*)*|[\p{N}]*)\]$

Values working:

[1234567],[89023432],[124534543],[4564362],[1234543],[12234567],[124567],[1234567],[1234567]

Catastrophic backtracking values — if anything is wrong in the values (an extra brace added at the end):

[1234567],[89023432],[124534543],[4564362],[1234543],[12234567],[124567],[1234567],[1234567]]
like image 663
Achilles Avatar asked Apr 20 '15 14:04

Achilles


People also ask

What is catastrophic backtracking in regular expressions?

What is the meaning of catastrophic backtracking in regular expressions? Interview Response: Catastrophic backtracking is a condition that can occur if you are checking a (usually long) string against a complex regular expression. The problem usually occurs if something towards the end of the string causes the string not to match.

What causes backtracking in regex?

One of the biggest causes of runaway backtracking is to have two or more alternatives that can match the same things. That's what you've got with the | [\p {N}]* part. The regex engine has to try every conceivable path through the string before it gives up, so all those \p {N}* constructs get into an endless tug-of-war over every group of digits.

Can We forbid backtracking for the quantifier in regexp?

We can forbid backtracking for the quantifier. The root of the problem is that the regexp engine tries many combinations that are obviously wrong for a human. E.g. in the regexp (\d+)*$ it’s obvious for a human, that + shouldn’t backtrack.

What is wrong with the regexp engine?

The root of the problem is that the regexp engine tries many combinations that are obviously wrong for a human. E.g. in the regexp (\d+)*$ it’s obvious for a human, that + shouldn’t backtrack. If we replace one \d+ with two separate \d+\d+, nothing changes: \d+........ (123456789)! \d+...\d+.... (1234) (56789)!


1 Answers

Never use * when + is what you mean. The first thing I noticed about your regex is that almost everything is optional. Only the opening and closing square brackets are required, and I'm pretty sure you don't want to treat [] as a valid input.

One of the biggest causes of runaway backtracking is to have two or more alternatives that can match the same things. That's what you've got with the |[\p{N}]* part. The regex engine has to try every conceivable path through the string before it gives up, so all those \p{N}* constructs get into an endless tug-of-war over every group of digits.

But there's no point trying to fix those problems, because the overall structure is wrong. I think this is what you're looking for:

^\[\p{N}+\](?:,\[\p{N}+\])*$

After it consumes the first token ([1234567]), if the next thing in the string is not a comma or the end of the string, it fails immediately. If it does see a comma, it must go on to match another complete token ([89023432]), or it fails immediately.

That's probably the most important thing to remember when you're creating a regex: if it's going to fail, you want it to fail as quickly as possible. You can use features like atomic groups and possessive quantifiers toward that end, but if you get the structure of the regex right, you rarely need them. Backtracking is not inevitable.

like image 197
Alan Moore Avatar answered Oct 28 '22 20:10

Alan Moore