Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there such a thing as a hard limit for regex matching?

Tags:

regex

php

pcre

I have this code I wrote for a routing component, here is the full code, which basically builds a batch of any number of large regexes (depending on a configuration value) out of many small (route-specific) ones in order to validate if a given request can or cannot be mapped by a registered route.

The general process involves first decorating each route, so that routes with placeholders like

/path/to/:variable_name

turn into a regex like like:

/path/to/(?P<R1V1>[^/]+)

And at verification time, this routes get glued together in a single regex per batch of arbitrary size.

For 'normal' usage, say, 500 routes with 1-4 placeholders, it works pretty well. But while benchmarking it, I noticed that for an extremely large number of placeholders AND and extremely large amount of routes (currently 11 placeholders and 50000 routes) the code I have fails at finding the last registered route.

I cannot figure out why. To the best of my knowledge, the thing should behave roughly in the same manner, taking (maybe? my O notation is rusty) O(n*m) times per order of magnitude (n being the amount of batches and m being how many regexes are there in a batch).

Maybe it's something in the way I'm testing this? If it's not, could you please point me to any problematic thing I'm doing?

If it is of any use, the benchmark I'm using is this exact one here

<?php
require dirname(__FILE__).'/../vendor/autoload.php';
$router = new \CFV\Router();
$dispatcher = new \CFV\Dispatcher();

$dispatcher::$ROUTES_PER_LOT = 20;
// $dispatcher::$THROW_ON_FAIL = true;

$callback = function (){};
$num_args = 11;
$routes_amount = 50000;
$matches_amount = 1;
$args = implode('/', array_map(function($i){ return ':arg' . $i; }, range(1, $num_args)));
$params = implode('/', array_map(function($i){ return '_arg' . $i; }, range(1, $num_args)));
$last_tried = '';

$load_start = microtime(true);
for ($i = 0, $str = 'a'; $i < $routes_amount; $i++, $str++) {
    $router->connect("/$str/$args", $callback);
    $last_tried = "/$str/$params";
}
printf("Took: %fs to load all\n", microtime(true) - $load_start);

$dispatcher->setRouter($router);
$search_start = microtime(true);

$found = $dispatcher->dispatch($last_tried);

printf("Took: %fs searching all\n", microtime(true) - $search_start);

Any pointers whatsoever would be just great.

like image 754
Carlos Vergara Avatar asked May 04 '14 18:05

Carlos Vergara


1 Answers

There are some limits. From the manual:

The maximum length of a subject string is the largest positive number that an integer variable can hold. However, PCRE uses recursion to handle subpatterns and indefinite repetition. This means that the available stack space may limit the size of a subject string that can be processed by certain patterns.

Also, there are PCRE's recursion limit and backtrack limit

If there are more than 15 capturing parentheses in a pattern, PCRE has to obtain extra memory to store data during a recursion, which it does by using pcre_malloc, freeing it via pcre_free afterwards. If no memory can be obtained, it saves data for the first 15 capturing parentheses only, as there is no way to give an out-of-memory error from within a recursion.

Ref:
http://php.net/manual/en/regexp.reference.recursive.php
http://php.net/pcre.configuration#ini.pcre.recursion-limit
http://php.net/pcre.configuration#ini.pcre.backtrack-limit

like image 177
eyecatchUp Avatar answered Nov 07 '22 19:11

eyecatchUp