Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex negative lookbehind and lookahead: equivalence and performance

I need a regex that will select only those URL strings NOT ending with specific extensions like .png or .css.

I tested the following:

1) this one using negative lookbehind:

(?<!\.png|\.css)$

https://regex101.com/r/tW4fO5/1

2) this other one using negative lookahead:

^(?!.*[.]png|.*[.]css$).*$

https://regex101.com/r/qZ7vA4/1

Both seems to work fine, but #1 (negative lookbehind) is said to be processed in 436 steps (see the link), while #2 (negative lookahead) is said to be processed in 173 steps.

So my question is: what does that mean? Is it going to have an impact on performances?

And lastly, are the two regex really functionally equivalent?

EDIT: SOLUTION SUMMARY

Just to wrap things up, considering the full list of string-endings to be excluded via the regex (a typical scenario would be a web server setup where static resources are served by apache while dynamic stuff is served by a different engine - in my case: php-fpm).

Two options are possible with PCRE regex:

1) negative lookbehind

$(?<!\.(?:ico|gif|jpg|png|css|rss|xml|htm|pdf|zip|txt|ttf)$|(?:js|gz)$|(?:html|woff)$)

https://regex101.com/r/eU9fI6/1

Notice that I used several OR-ed lookbehinds because the negative lookbehind requires a fixed-width pattern (ie: you cannot mix patterns of different lengths). This makes this options sligthly more complex to write. Moreover this lowers its performance in my opinion.

2) negative lookahead

^(?!.*[.](?:js|ico|gif|jpg|png|css|rss|xml|htm|html|pdf|zip|gz|txt|ttf|woff)$).*$

https://regex101.com/r/dP7uD9/1

The lookahead is slightly faster than the lookbehind. This is a test result from making 1 million iterations:

time lookbehind = 18.469825983047 secs
time lookahead = 14.316685199738 secs

If I had not the issue of the variable lenght patterns, I would pick the lookbehind since it looks more compact. Either one is good anyway. At the end, I went with the lookahead:

<LocationMatch "^(?!.*[.](?:js|ico|gif|jpg|png|css|rss|xml|htm|html|pdf|zip|gz|txt|ttf|woff)$).*$">
    SetHandler "proxy:unix:/var/run/php5-fpm.sock|fcgi://www/srv/www/gioplet/web/public/index.php"
</LocationMatch>
like image 696
Timido Avatar asked Feb 18 '16 08:02

Timido


2 Answers

Is it going to have an impact on performances?

In most cases, the more steps a regex needs to find a match, the slower the performance is. Although it also depends what platform you will use the regex in later (say, if you test a regex for use in .NET using regex101.com, it does not mean it will cause a catastrophic backtracking with a lazy dot matching regex failing with a long text).

Are the two regex really functionally equivalent?

No, they aren't. (?<!\.png|\.css)$ finds an end of the line that is not preceded with .png or .css. ^(?!.*[.]png|.*[.]css$).*$ finds lines that do not contain .png or the lines that do not end with .css. To make them "equivalent" (that is, if you want to make sure the lines ending with .png or .css are not matched), use

^(?!.*[.](?:png|css)$).*$
         ^^^^^^^^^^^^

Make sure the $ is checked after both png and css in the negative lookahead.

There will still be the difference between the regexps: the first will just match the end of the line, and the second will match the whole line.

Is there a way to speed up the lookbehind solution?

Note that the lookbehind in Pattern 1 is checked at each location inside the string. The lookahead in Pattern 2 is only checked once, at the very beginning of the string. That is why an anchored lookahead solution will be faster UNDER one condition - if you cannot use a RightToLeft modifier that is only available in few regex flavors (e.g. .NET).

The $(?<!\.(?:png|css)$) lookbehind solution is faster than Pattern 1 because the lookbehind pattern is checked just once, after reaching the end of string/line. Still, this takes a bit more steps because of the implementation of a lookbehind that is costlier than a lookahead.

To really find out which solution is fastest, you need to set up performance tests in your environment.

like image 74
Wiktor Stribiżew Avatar answered Nov 15 '22 10:11

Wiktor Stribiżew


The second one or the lookahead one is faster. Remember number of steps is not the correct way. See the Stackoverflow question: atomic-groups-clarity.

I have tested on python using timeit. The script is

import timeit
s1="""
import re
re.findall(r"^.*(?<!\.png|\.css)$",x,re.M)"""

s2="""
import re
re.findall(r"^(?!.*[.]png$|.*[.]css$).*$",x,re.M)"""

print timeit.timeit(s1,number=1000000,setup='x="""http://gioplet/articles\nhttp://gioplet/img/logo.png\nhttp://gioplet/index.php\nhttp://gioplet/css/main.css"""')

print timeit.timeit(s2,number=1000000,setup='x="""http://gioplet/articles\nhttp://gioplet/img/logo.png\nhttp://gioplet/index.php\nhttp://gioplet/css/main.css"""')

Output:

8.72536265265
7.09159428305
like image 22
vks Avatar answered Nov 15 '22 08:11

vks