Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combinatorial product of regex substitutions

I am trying to produce string variants by applying substitutions optionally.

For example, one substitution scheme is removing any sequence of blank characters. Rather than replacing all occurrences like

>>> re.sub(r'\s+', '', 'a b c')
'abc'

– I need, instead, two variants to be produced for each occurrence, in that the substitution is performed in one variant, but not in the other. For the string 'a b c' I want to have the variants

['a b c', 'a bc', 'ab c', 'abc']

ie. the cross product of all binary decisions (the result obviously includes the original string).

For this case, the variants can be produced using re.finditer and itertools.product:

def vary(target, pattern, subst):
    occurrences = [m.span() for m in pattern.finditer(target)]
    for path in itertools.product((True, False), repeat=len(occurrences)):
        variant = ''
        anchor = 0
        for (start, end), apply_this in zip(occurrences, path):
            if apply_this:
                variant += target[anchor:start] + subst
                anchor = end
        variant += target[anchor:]
        yield variant

This produces the desired output for the above example:

>>> list(vary('a b c', re.compile(r'\s+'), ''))
['abc', 'ab c', 'a bc', 'a b c']

However, this solution only works for fixed-string replacements. Advanced features from re.sub like group references can't be done like that, as in the following example for inserting a space after a sequence of digits inside a word:

re.sub(r'\B(\d+)\B'), r'\1 ', 'abc123def')

How can the approach be extended or changed to accept any valid argument to re.sub (without writing a parser for interpreting group references)?

like image 471
lenz Avatar asked Jan 06 '16 16:01

lenz


1 Answers

Thinking about making subst a callable that gets access to match data finally made me learn about MatchObject.expand. So, as an approximation, with subst staying an r string,

def vary(target, pattern, subst):
    matches = [m for m in pattern.finditer(target)]
    occurrences = [m.span() for m in matches]
    for path in itertools.product((True, False), repeat=len(occurrences)):
        variant = ''
        anchor = 0
        for match, (start, end), apply_this in zip(matches, occurrences, path):
            if apply_this:
                variant += target[anchor:start] + match.expand(subst)
                anchor = end
        variant += target[anchor:]
        yield variant

I am not sure, though, that this covers all needed flexibility in referring to the subject string, being bount to the corresponding match. An indexed power set of the split string came to mind, but I guess that's not far from the parser mentioned.

like image 112
B98 Avatar answered Sep 30 '22 17:09

B98