Recomputing substring counts updating after replacing it with 'X'

Question

Given a string:

s = 'cdababef'

We calculate the character before and character after with:

def per_window(sequence, n=1):
    """
    From http://stackoverflow.com/q/42220614/610569
        >>> list(per_window([1,2,3,4], n=2))
        [(1, 2), (2, 3), (3, 4)]
        >>> list(per_window([1,2,3,4], n=3))
        [(1, 2, 3), (2, 3, 4)]
    """
    start, stop = 0, n
    seq = list(sequence)
    while stop <= len(seq):
        yield tuple(seq[start:stop])
        start += 1
        stop += 1

char_before= defaultdict(Counter)
char_after = defaultdict(Counter) 
for window in per_window(s, 3):
    char_after[window[:2]][window[2]] += 1
    char_before[window[1:]][window[0]] += 1

[out]:

>>> char_after
defaultdict(collections.Counter,
            {('a', 'b'): Counter({'a': 1, 'e': 1}),
             ('b', 'a'): Counter({'b': 1}),
             ('b', 'e'): Counter({'f': 1}),
             ('c', 'd'): Counter({'a': 1}),
             ('d', 'a'): Counter({'b': 1})})

>>> char_before
defaultdict(collections.Counter,
            {('a', 'b'): Counter({'b': 1, 'd': 1}),
             ('b', 'a'): Counter({'a': 1}),
             ('b', 'e'): Counter({'a': 1}),
             ('d', 'a'): Counter({'c': 1}),
             ('e', 'f'): Counter({'b': 1})})

Let's say if I replace all instances of ab with x, and I need to update the the char_after and char_before counts and the goal is to achieve without re-counting of all the substring of s = 'cdxxef' like:

s = 'cdxxef'
char_before2 = defaultdict(Counter)
char_after2 = defaultdict(Counter) 
for window in per_window(s, 3):
    char_after2[window[:2]][window[2]] += 1
    char_before2[window[1:]][window[0]] += 1

[desired outputs]:

>>> char_before2
defaultdict(collections.Counter,
            {('d', 'x'): Counter({'c': 1}),
             ('e', 'f'): Counter({'x': 1}),
             ('x', 'e'): Counter({'x': 1}),
             ('x', 'x'): Counter({'d': 1})})

>>> char_after2
defaultdict(collections.Counter,
            {('c', 'd'): Counter({'x': 1}),
             ('d', 'x'): Counter({'x': 1}),
             ('x', 'e'): Counter({'f': 1}),
             ('x', 'x'): Counter({'e': 1})})

How can the updates of the substring be done without re-counting of all the substring but only the substrings affected by the replacements?

I've tried:

s = 'cdababef'

char_before= defaultdict(Counter)
char_after = defaultdict(Counter) 
for window in per_window(s, 3):
    char_after[window[:2]][window[2]] += 1
    char_before[window[1:]][window[0]] += 1

source, target = ('a', 'b'), 'x'
for ch in char_before[source]:
    count_before = char_before[source][ch]
    char_before[target][ch] += count_before
    char_before[source][ch] = 0

    count_after = char_after[source][ch]
    char_after[target][ch] += count_after
    char_before[source][ch] = 0

But the output is not the desired one as with char_before2 and char_after2:

>>> char_before
defaultdict(collections.Counter,
            {'x': Counter({'b': 1, 'd': 1}),
             ('b', 'a'): Counter({'a': 1}),
             ('d', 'a'): Counter({'c': 1}),
             ('b', 'e'): Counter({'a': 1}),
             ('a', 'b'): Counter({'b': 0, 'd': 0}),
             ('e', 'f'): Counter({'b': 1})})

>>> char_after
defaultdict(collections.Counter,
            {'x': Counter({'b': 0, 'd': 0}),
             ('b', 'a'): Counter({'b': 1}),
             ('d', 'a'): Counter({'b': 1}),
             ('b', 'e'): Counter({'f': 1}),
             ('a', 'b'): Counter({'a': 1, 'e': 1}),
             ('c', 'd'): Counter({'a': 1})})

bunji · Accepted Answer

Here is an approach that tackles this problem in three steps:

identify the substrings affected by the substitutions
remove the counts from the char_before and char_after dictionaries that are in these substrings
do the substitution and add the new counts to the char_before and char_after dictionaries for the affected substrings only

First let's start by defining some variables and running your initial code.

source, target = ('a', 'b'), 'x'
n = 3

char_before= defaultdict(Counter)
char_after = defaultdict(Counter) 
for window in per_window(s, n):
    char_after[window[:2]][window[2]] += 1
    char_before[window[1:]][window[0]] += 1

Now we find the spans (beginning and ending indexes) of the substrings that will be replaced (Note that we're not actually doing any replacement yet)

import re 

spans = [m.span() for m in re.finditer(''.join(source), s)]

But we know that the before and after counts of the windows that fall into one of these spans are not the only ones that will be affected by the replacement. Any window directly preceding or following one of these spans will also be affected. For example, in s = 'cdababef' if we replace 'ab' with 'x' the initial 'cd' will need it's char_after counts updated even though no part of 'cd' itself is replaced.

To handle this we define a function called merge_spans that not only merges adjacent spans ((2,4) and (4,6) becomes (2,6)) but also merges spans that are within extra spaces from each other (where extra is an integer defined by a keyword argument). The intuition behind this is that this will return a list of spans where the spans correspond to all the substrings who's before/after counts will be affected by the replacement.

def merge_spans(spans, extra = 0):
    extra = max(0,extra)
    merged = spans[:]
    if len(merged) == 1:
        return [(max(merged[0][0]-extra, 0), merged[0][-1]+extra)]
    for i in range(1, len(merged)):
        span = merged[i]
        prev = merged[i-1]
        if prev[-1]+extra >= span[0]-extra:
            merged[i] = (max(0,prev[0]-extra), span[-1]+extra)
            merged[i-1] = ()
        elif i == len(merged)-1:
            merged[i] = (max(0,span[0]-extra), span[-1]+extra)
            merged[i-1] = (max(0,prev[0]-extra), prev[-1]+extra)
        else:
            merged[i-1] = (max(0,prev[0]-extra), prev[-1]+extra)
    return list(filter(None, merged))

So lets create this list of spans. We set extra to n-1 since n-1 letters on either side of a replacement will be affected.

merged = merge_spans(spans, n-1)

Now we can iterate over these spans and remove the counts for the windows that are affected by the substitution. Then we can do the substitution just within that span and update the counts.

for span in merged:
    sub_s = s[span[0]:span[-1]]
    for window in per_window(sub_s, n):
        char_after[window[:2]][window[2]] -= 1
        char_before[window[1:]][window[0]] -= 1
    new_s = sub_s.replace(''.join(source), target)
    for window in per_window(new_s, n):
        char_after[window[:2]][window[2]] += 1
        char_before[window[1:]][window[0]] += 1

Note that the above will affect the original char_before and char_after dictionaries but you can always make copies of them first if you need to keep the original counts for some reason.

Finally, we remove any counts from the counter that are 0 or negative and completely remove any windows that contain no positive counts at all. Note that adding Counter() to a counter removes any elements who's value is not positive.

char_before2 = {k:v+Counter() for k,v in char_before.items() if any(v.values())}
char_after2 = {k:v+Counter() for k,v in char_after.items() if any(v.values())}

And the result:

>>> char_before2
{('d', 'x'): Counter({'c': 1}),
 ('e', 'f'): Counter({'x': 1}),
 ('x', 'e'): Counter({'x': 1}),
 ('x', 'x'): Counter({'d': 1})}

>>> char_after2 
{('c', 'd'): Counter({'x': 1}),
 ('d', 'x'): Counter({'x': 1}),
 ('x', 'e'): Counter({'f': 1}),
 ('x', 'x'): Counter({'e': 1})}

Recomputing substring counts updating after replacing it with 'X'

Tags:

python

string

substring

replace

counter

alvas

1 Answers

bunji

Recent Activity

Donate For Us

Recomputing substring counts updating after replacing it with 'X'

Tags:

python

string

substring

replace

counter

alvas

1 Answers

bunji

Related questions

Recent Activity

Donate For Us