Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Any way to treat .* as .{0,1024} in perl RE?

We allow some user-supplied REs for the purpose of filtering email. Early on we ran into some performance issues with REs that contained, for example, .*, when matching against arbitrarily-large emails. We found a simple solution was to s/\*/{0,1024}/ on the user-supplied RE. However, this is not a perfect solution, as it will break with the following pattern:

/[*]/

And rather than coming up with some convoluted recipe to account for every possible mutation of user-supplied RE input, I'd like to just limit perl's interpretation of the * and + characters to have a maximum length of 1024 characters.

Is there any way to do this?

like image 892
Flimzy Avatar asked Dec 15 '11 09:12

Flimzy


4 Answers

This does not really answer your question, but you should be aware of other issues with user-supplied regular expressions, see for example this summary at OWASP. Depending on your exact situation, it might be better to write or find a custom simple pattern matching library?

like image 114
zoul Avatar answered Oct 17 '22 06:10

zoul


Update

Added a (?<!\\) before the quantifiers, because escaped *+ should not be matched. Replacement will still fail if there is an \\* (match \ 0 or more times).

An improvement would be this

s/(?<!\\)\*(?!(?<!\\)[^[]*?(?<!\\)\])/{0,1024}/
s/(?<!\\)\+(?!(?<!\\)[^[]*?(?<!\\)\])/{1,1024}/

See it here on Regexr

That means match [*+] but only if there is no closing ] ahead and no [ till then. And there is no \ (the (?<!\\) part) allowed before the square brackets.

(?! ... ) is a negative lookahead

(?<! ... ) is a negative lookbehind

See perlretut for details

Update 2 include possessive quantifiers

s/(?<!(?<!\\)[\\+*?])\+(?!(?<!\\)[^[]*?(?<!\\)\])/{1,1024}/   # for +
s/(?<!\\)\*(?!(?<!\\)[^[]*?(?<!\\)\])/{0,1024}/    # for *

See it here on Regexr

Seems to be working, but its getting real complicated now!

like image 36
stema Avatar answered Oct 17 '22 04:10

stema


Get a tree using Regexp::Parser and modify regex as you want, or provide GUI interface to Regexp::English

like image 34
jabberwocky Avatar answered Oct 17 '22 06:10

jabberwocky


You mean except of patching the source?

  1. You can break the input texts in shorter chunks and match only those. But then again, you wouldn't match over a "line" break.
  2. You can break the regex, search only for the 1st char of it, load the next 1024 chars of text and then match the whole regex on this (obviously, that doesn't work with regex starting with .)
  3. Find the first char of the regex that is not .*+()\, find that, load 1024 chars before and after and then match the whole regex on this string. (complicated and prune to errors in strange unforeseen regex)
like image 25
Nikodemus RIP Avatar answered Oct 17 '22 05:10

Nikodemus RIP