I am trying to produce string variants by applying substitutions optionally.
For example, one substitution scheme is removing any sequence of blank characters. Rather than replacing all occurrences like
>>> re.sub(r'\s+', '', 'a b c')
'abc'
– I need, instead, two variants to be produced for each occurrence, in that the substitution is performed in one variant, but not in the other.
For the string 'a b c'
I want to have the variants
['a b c', 'a bc', 'ab c', 'abc']
ie. the cross product of all binary decisions (the result obviously includes the original string).
For this case, the variants can be produced using re.finditer
and itertools.product
:
def vary(target, pattern, subst):
occurrences = [m.span() for m in pattern.finditer(target)]
for path in itertools.product((True, False), repeat=len(occurrences)):
variant = ''
anchor = 0
for (start, end), apply_this in zip(occurrences, path):
if apply_this:
variant += target[anchor:start] + subst
anchor = end
variant += target[anchor:]
yield variant
This produces the desired output for the above example:
>>> list(vary('a b c', re.compile(r'\s+'), ''))
['abc', 'ab c', 'a bc', 'a b c']
However, this solution only works for fixed-string replacements.
Advanced features from re.sub
like group references can't be done like that,
as in the following example for inserting a space after a sequence of digits inside a word:
re.sub(r'\B(\d+)\B'), r'\1 ', 'abc123def')
How can the approach be extended or changed to accept any valid argument to re.sub (without writing a parser for interpreting group references)?
Thinking about making subst
a callable that gets access to match data finally made me learn about MatchObject.expand
. So, as an approximation, with subst
staying an r
string,
def vary(target, pattern, subst):
matches = [m for m in pattern.finditer(target)]
occurrences = [m.span() for m in matches]
for path in itertools.product((True, False), repeat=len(occurrences)):
variant = ''
anchor = 0
for match, (start, end), apply_this in zip(matches, occurrences, path):
if apply_this:
variant += target[anchor:start] + match.expand(subst)
anchor = end
variant += target[anchor:]
yield variant
I am not sure, though, that this covers all needed flexibility in referring to the subject string, being bount to the corresponding match. An indexed power set of the split string came to mind, but I guess that's not far from the parser mentioned.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With