Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"Regular Expression is too large" error in PHP

Tags:

regex

php

I am working on a relatively complex, and very large regular expression. It is currently 41,127 characters, and may grow somewhat as additional cases may be added. I am starting to get this error in PHP:

preg_match_all(): Compilation failed: regular expression is too large at offset 41123

Is there a way to increase the size limit? The following settings suggested elsewhere did NOT work because these apply to size of data and NOT the regex size:

ini_set("pcre.backtrack_limit", "100000000");
ini_set("pcre.recursion_limit", "100000000");

Alternatively, is there a way to define a "sub-pattern variable" within the regex that could be repeated at various places within the regex? (I am not talking about repetition using * or +, or even repeating matched "1")? I am actually using PHP variables containing sub-Patterns that are repeated in few places within the regex, but this leads to expansion of the regex BEFORE it is passed on to PRCE functions.

This is a complex regular expression, and cannot be replaced by simpler keyword-searching using strpos or similar as suggested at this link.

I would prefer to avoid splitting this into sub-expressions at | and trying to match the sub-expressions separately, because the reduction in size would be modest (there are only 2 or 3 of top-level |), and this would complicate further development.

like image 565
mkulon Avatar asked Jul 01 '15 22:07

mkulon


3 Answers

Depending on the application, valid solutions are:

  • Shorten the Regular Expression by using DEFINE for any redundant sub-expressions (see below).
  • Increase the max limit on regex size by re-compiling PHP (see drew010's great answer). Although this may not be available in all environments or may create compatibility issues if changing servers.
  • Split your regular expression at | and process the resulting sub-expressions separately. If the regex is essentially numerous keywords separated by |, then converting to a strtok or a loop with strpos may be a better & faster choice.
  • Use other language / regex engine such as C++/Boost, although I did not verify this.

Solution to my specific problem: As per Mario's comment, using the (?(DEFINE)...) construct for some of the sub-expressions that were re-used several times reduced my regex size from 41,127 characters down to "only" 4,071, and this was an elegant solution to get rid of the error “Regular Expression is too large.”

See: (?(DEFINE)...) syntax reference at rexegg.com

like image 176
mkulon Avatar answered Sep 30 '22 05:09

mkulon


have your tried array_chunk to split your array then use you preg_match_all in foreach(). I was using the exact same code and i have an array of 40k+, so i go through the above solutions but it didn't solved my "Compilation failed: regular expression is too large at offset" problem then i split my 40k+ array into 4 arrays of 1k elements and and used foreach() over my preg_match_all condition and voila! it worked.

like image 35
simran Avatar answered Sep 30 '22 03:09

simran


I don't disagree with the comments that there might be a better way to do this, but I will answer the question here.

You can increase the maximum size of the regex but only by recompiling PHP yourself. Because of this, your code is not at all portable and if you are using pre-compiled binaries, you are out of luck.

That said, I would suggest finding an alternative for matching.

See pcre_internal.h for comments.

PCRE keeps offsets in its compiled code as 2-byte quantities (always stored in big-endian order) by default. These are used, for example, to link from the start of a subpattern to its alternatives and its end. The use of 2 bytes per offset limits the size of the compiled regex to around 64K, which is big enough for almost everybody. However, I received a request for an even bigger limit. For this reason, and also to make the code easier to maintain, the storing and loading of offsets from the byte string is now handled by the macros that are defined here.

The macros are controlled by the value of LINK_SIZE. This defaults to 2 in the config.h file, but can be overridden by using -D on the command line. This is automated on Unix systems via the "configure" command.

So you can either edit ext/pcre/pcrelib/config.h from the PHP source distribution to increase the size limit, or specify it when compiling ./configure -DLINK_SIZE=4

EDIT: If you are trying to match/parse HTML, I would recommend using DOMDocument to parse the HTML and then walk the DOM tree or build an XPATH to find what you're looking for.

like image 25
drew010 Avatar answered Sep 30 '22 04:09

drew010