This question is a follow-up for the following post: Javascript regex: Find all URLs outside <a> tags - Nested Tags
I discovered that the code:
\b((https?|ftps?):\/\/[^"<\s]+)(?![^<>]*>|[^"]*?<\/a)
is extremely inefficient compared to executing it separately for http
and ftp
part like this:
\b(https?:\/\/[^"<\s]+)(?![^<>]*>|[^"]*?<\/a)
and
\b(ftps?:\/\/[^"<\s]+)(?![^<>]*>|[^"]*?<\/a)
Here are examples at regex101.com:
However, in one of my HTML page these codes compares as 85628 steps vs. 7258 + 795 steps, that is quite insane.
As far as I have seen, using (x|y) pattern reduces the execution length but here probably for a strange reason it is otherwise.
Any help would be appreciated.
It seems that you are a victim of catastrophic backtracking.
This regex does the trick in just 3492 steps:
\b(?>(https?|ftps?):\/\/[^"<\s]+)(?![^<>]*>|[^"]*?<\/a)
All I have done is made the first group an atomic group, causing the engine to discard all backtracking options once it's matched it.
That's correct in your case: you can think of it now as two parts, "find a URL" then "Use the negative lookahead to decide if we want to keep it". Your original regex would, when the lookahead failed, proceed to backtrack into the url-matching expression. The [^"<\s]+
block would yield some symbols, then it would try the lookahead again, then yield some more symbols, and try again, and so on...
The reason the addition of the https?|ftps?
part made it so much worse was that this provides an extra source of backtracking (losing the optional s) in a way that allows all the later backtracking to happen all over again.
You know that regex101.com has a "regex debugger" option on the toolbar on the left? If you use that, it explains how your regex matches, so you can (as I just did) figure out where the crazy backtracking is.
Bonus edit: A further improved one that only takes 3185 steps:
\b(?>ht|f)tps?:\/\/(?>[^"<\s]+)(?![^<>]+>|[^"]*?<\/a)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With