Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding a string *and* its substrings in a haystack

Suppose you have a string (e.g. needle). Its 19 continuous substrings are:

needle
needl eedle
need eedl edle
nee eed edl dle
ne ee ed dl le
n e d l

If I were to build a regex to match, in a haystack, any of the substrings I could simply do:

/(needle|needl|eedle|need|eedl|edle|nee|eed|edl|dle|ne|ee|ed|dl|le|n|e|d|l)/

but it doesn't look really elegant. Is there a better way to create a regex that will greedly match any one of the substrings of a given string?

Additionally, what if I posed another constraint, wanted to match only substrings longer than a threshold, e.g. for substrings of at least 3 characters:

/(needle|needl|eedle|need|eedl|edle|nee|eed|edl|dle)/

note: I deliberately did not mention any particular regex dialect. Please state which one you're using in your answer.

like image 575
CAFxX Avatar asked Jan 21 '12 17:01

CAFxX


3 Answers

As Qtax suggested, the expression

n(e(e(d(l(e)?)?)?)?)?|e(e(d(l(e)?)?)?)?|e(d(l(e)?)?)?|d(l(e)?)?|l(e)?|e

would be the way to go if you wanted to write an explicit regular expression (egrep syntax, optionally replace (...) by (?:...)). The reason why this is better than the initial solution is that the condensed version requires only O(n^2) space compared to O(n^3) space in the original version, where n is the length of the input. Try this with extraordinarily as input to see the difference. I guess the condensed version is also faster with many regexp engines out there.

The expression

nee(d(l(e)?)?)?|eed(l(e)?)?|edl(e)?|dle

will look for substrings of length 3 or longer.

As pointed out by vhallac, the generated regular expressions are a bit redundant and can be optimized. Apart from the proposed Emacs tool, there is a Perl package Regexp::Optimizer that I hoped would help here, but a quick check failed for the first regular expression.

Note that many regexp engines perform non-overlapping search by default. Check this with the requirements of your problem.

like image 64
krlmlr Avatar answered Nov 09 '22 03:11

krlmlr


I have found elegant almostsolution, depending how badly you need only one regexp. For example here is the regexp, which finds common substring (perl) of length 7:

"$needle\0$heystack" =~ /(.{7}).*?\0.*\1/s

Matching string is in \1. Strings should not contain null character which is used as separator.

You should make a cycle which starters with length of the needle and goes downto treshold and tries to match the regexp.

like image 33
gorn Avatar answered Nov 09 '22 05:11

gorn


Is there a better way to create a regex that will match any one of the substrings of a given string?

No. But you can generate such expression easily.

like image 26
Qtax Avatar answered Nov 09 '22 03:11

Qtax