Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does this regex take so long to find email addresses in certain files?

I have a regular expression that looks for email addresses ( this was taken from another SO post that I can't find and has been tested on all kinds of email configurations ... changing this is not exactly my question ... but understand if that is the root cause ):

/[a-z0-9_\-\+]+@[a-z0-9\-]+\.([a-z]{2,3})(?:\.[a-z]{2})?/i

I'm using preg_match_all() in PHP.

This works great for 99.99...% of files I'm looking in and takes around 5ms, but occasionally takes a couple minutes. These files are larger than the average webpage at around 300k, but much larger files generally process fine. The only thing I can find in the file contents that stands out is strings of thousands of consecutive "random" alphanumeric characters like this:

wEPDwUKMTk0ODI3Nzk5MQ9kFgICAw9kFgYCAQ8WAh4H...

Here are two pages causing the problem. View source to see the long strings.

  • http://www.ashrae.org/members/page/607
  • http://www.ashrae.org/publications/page/2010ajindex

Any thoughts on what is causing this?

--FINAL SOLUTION--

I tested various regexes suggested in the answers. @FailedDev's answer helped and dropped processing time from a few minutes to a few seconds. @hakre's answer solved the problem and reduced processing time to a few hundred milliseconds. Below is the final regex I used. It's @hakre's second suggestion.

/[a-z0-9_\-\+]{1,256}+@[a-z0-9\-]{1,256}+\.([a-z]{2,3})(?:\.[a-z]{2})?/i
like image 925
T. Brian Jones Avatar asked Nov 17 '11 23:11

T. Brian Jones


2 Answers

You already know that your regex is causing an issue for large files. So maybe you can make it a bit smarter?

For example, you're using + to match one or more chars. Let's say you have a string of 10 000 chars. The regex must look 10 000 combinations to find the largest match. Then you combine it with similar ones. Let's say you have a string with 20 000 chars and two + groups. How could they match in the file. Probably 10 000 x 10 000 possibilities. And so on and so forth.

If you can limit the number of characters (this looks a bit like you're looking for email patterns), probably limit the email address domain name to 256 and the address itself to 256 characters. Then this would be 256 x 256 possibilities to test "only":

/[a-z0-9_\-\+]{1,256}@[a-z0-9\-]{1,256}\.([a-z]{2,3})(?:\.[a-z]{2})?/i

That's probably already much faster. Then making those quantifiers possessive will reduce backtracking for PCRE:

/[a-z0-9_\-\+]{1,256}+@[a-z0-9\-]{1,256}+\.([a-z]{2,3})(?:\.[a-z]{2})?/i

Which should speed it up again.

like image 184
hakre Avatar answered Sep 20 '22 12:09

hakre


My best guess would be to try using possesive quantifiers :

[a-z0-9_\-\+]+

to

[a-z0-9_\-\+]++

This should fail the regex faster so it may improve performance in these situations.

Edit:

Maybe atomic grouping could also help :

/(?>[a-z0-9_\-+]++)@(?>[a-z0-9\-]++\.)(?>[a-z]{2,3})(?:\.[a-z]{2})?/

You should first go with option one. It would be interesting to see if there is any difference by also using option two.

like image 41
FailedDev Avatar answered Sep 19 '22 12:09

FailedDev