Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Javascript regex: Find all URLs optimization

This question is a follow-up for the following post: Javascript regex: Find all URLs outside <a> tags - Nested Tags

I discovered that the code:

\b((https?|ftps?):\/\/[^"<\s]+)(?![^<>]*>|[^"]*?<\/a)

is extremely inefficient compared to executing it separately for http and ftp part like this:

\b(https?:\/\/[^"<\s]+)(?![^<>]*>|[^"]*?<\/a)

and

\b(ftps?:\/\/[^"<\s]+)(?![^<>]*>|[^"]*?<\/a)

Here are examples at regex101.com:

  • 1st method - 6395 steps
  • 2nd method - 3393 steps + 863 steps

However, in one of my HTML page these codes compares as 85628 steps vs. 7258 + 795 steps, that is quite insane.

As far as I have seen, using (x|y) pattern reduces the execution length but here probably for a strange reason it is otherwise.

Any help would be appreciated.

like image 468
Klaidonis Avatar asked Mar 12 '16 09:03

Klaidonis


1 Answers

It seems that you are a victim of catastrophic backtracking.

This regex does the trick in just 3492 steps:

\b(?>(https?|ftps?):\/\/[^"<\s]+)(?![^<>]*>|[^"]*?<\/a)

All I have done is made the first group an atomic group, causing the engine to discard all backtracking options once it's matched it.

That's correct in your case: you can think of it now as two parts, "find a URL" then "Use the negative lookahead to decide if we want to keep it". Your original regex would, when the lookahead failed, proceed to backtrack into the url-matching expression. The [^"<\s]+ block would yield some symbols, then it would try the lookahead again, then yield some more symbols, and try again, and so on...

The reason the addition of the https?|ftps? part made it so much worse was that this provides an extra source of backtracking (losing the optional s) in a way that allows all the later backtracking to happen all over again.

You know that regex101.com has a "regex debugger" option on the toolbar on the left? If you use that, it explains how your regex matches, so you can (as I just did) figure out where the crazy backtracking is.

Bonus edit: A further improved one that only takes 3185 steps:

\b(?>ht|f)tps?:\/\/(?>[^"<\s]+)(?![^<>]+>|[^"]*?<\/a)
like image 104
Chris Kitching Avatar answered Nov 01 '22 12:11

Chris Kitching