Given a string:
s = 'cdababef'
We calculate the character before and character after with:
def per_window(sequence, n=1):
"""
From http://stackoverflow.com/q/42220614/610569
>>> list(per_window([1,2,3,4], n=2))
[(1, 2), (2, 3), (3, 4)]
>>> list(per_window([1,2,3,4], n=3))
[(1, 2, 3), (2, 3, 4)]
"""
start, stop = 0, n
seq = list(sequence)
while stop <= len(seq):
yield tuple(seq[start:stop])
start += 1
stop += 1
char_before= defaultdict(Counter)
char_after = defaultdict(Counter)
for window in per_window(s, 3):
char_after[window[:2]][window[2]] += 1
char_before[window[1:]][window[0]] += 1
[out]:
>>> char_after
defaultdict(collections.Counter,
{('a', 'b'): Counter({'a': 1, 'e': 1}),
('b', 'a'): Counter({'b': 1}),
('b', 'e'): Counter({'f': 1}),
('c', 'd'): Counter({'a': 1}),
('d', 'a'): Counter({'b': 1})})
>>> char_before
defaultdict(collections.Counter,
{('a', 'b'): Counter({'b': 1, 'd': 1}),
('b', 'a'): Counter({'a': 1}),
('b', 'e'): Counter({'a': 1}),
('d', 'a'): Counter({'c': 1}),
('e', 'f'): Counter({'b': 1})})
Let's say if I replace all instances of ab
with x
, and I need to update the the char_after
and char_before
counts and the goal is to achieve without re-counting of all the substring of s = 'cdxxef'
like:
s = 'cdxxef'
char_before2 = defaultdict(Counter)
char_after2 = defaultdict(Counter)
for window in per_window(s, 3):
char_after2[window[:2]][window[2]] += 1
char_before2[window[1:]][window[0]] += 1
[desired outputs]:
>>> char_before2
defaultdict(collections.Counter,
{('d', 'x'): Counter({'c': 1}),
('e', 'f'): Counter({'x': 1}),
('x', 'e'): Counter({'x': 1}),
('x', 'x'): Counter({'d': 1})})
>>> char_after2
defaultdict(collections.Counter,
{('c', 'd'): Counter({'x': 1}),
('d', 'x'): Counter({'x': 1}),
('x', 'e'): Counter({'f': 1}),
('x', 'x'): Counter({'e': 1})})
How can the updates of the substring be done without re-counting of all the substring but only the substrings affected by the replacements?
I've tried:
s = 'cdababef'
char_before= defaultdict(Counter)
char_after = defaultdict(Counter)
for window in per_window(s, 3):
char_after[window[:2]][window[2]] += 1
char_before[window[1:]][window[0]] += 1
source, target = ('a', 'b'), 'x'
for ch in char_before[source]:
count_before = char_before[source][ch]
char_before[target][ch] += count_before
char_before[source][ch] = 0
count_after = char_after[source][ch]
char_after[target][ch] += count_after
char_before[source][ch] = 0
But the output is not the desired one as with char_before2
and char_after2
:
>>> char_before
defaultdict(collections.Counter,
{'x': Counter({'b': 1, 'd': 1}),
('b', 'a'): Counter({'a': 1}),
('d', 'a'): Counter({'c': 1}),
('b', 'e'): Counter({'a': 1}),
('a', 'b'): Counter({'b': 0, 'd': 0}),
('e', 'f'): Counter({'b': 1})})
>>> char_after
defaultdict(collections.Counter,
{'x': Counter({'b': 0, 'd': 0}),
('b', 'a'): Counter({'b': 1}),
('d', 'a'): Counter({'b': 1}),
('b', 'e'): Counter({'f': 1}),
('a', 'b'): Counter({'a': 1, 'e': 1}),
('c', 'd'): Counter({'a': 1})})
Here is an approach that tackles this problem in three steps:
char_before
and char_after
dictionaries that are in these substringschar_before
and char_after
dictionaries for the affected substrings only First let's start by defining some variables and running your initial code.
source, target = ('a', 'b'), 'x'
n = 3
char_before= defaultdict(Counter)
char_after = defaultdict(Counter)
for window in per_window(s, n):
char_after[window[:2]][window[2]] += 1
char_before[window[1:]][window[0]] += 1
Now we find the spans (beginning and ending indexes) of the substrings that will be replaced (Note that we're not actually doing any replacement yet)
import re
spans = [m.span() for m in re.finditer(''.join(source), s)]
But we know that the before and after counts of the windows that fall into one of these spans are not the only ones that will be affected by the replacement. Any window directly preceding or following one of these spans will also be affected. For example, in s = 'cdababef'
if we replace 'ab'
with 'x'
the initial 'cd'
will need it's char_after
counts updated even though no part of 'cd'
itself is replaced.
To handle this we define a function called merge_spans
that not only merges adjacent spans ((2,4)
and (4,6)
becomes (2,6)
) but also merges spans that are within extra
spaces from each other (where extra
is an integer defined by a keyword argument). The intuition behind this is that this will return a list of spans where the spans correspond to all the substrings who's before/after counts will be affected by the replacement.
def merge_spans(spans, extra = 0):
extra = max(0,extra)
merged = spans[:]
if len(merged) == 1:
return [(max(merged[0][0]-extra, 0), merged[0][-1]+extra)]
for i in range(1, len(merged)):
span = merged[i]
prev = merged[i-1]
if prev[-1]+extra >= span[0]-extra:
merged[i] = (max(0,prev[0]-extra), span[-1]+extra)
merged[i-1] = ()
elif i == len(merged)-1:
merged[i] = (max(0,span[0]-extra), span[-1]+extra)
merged[i-1] = (max(0,prev[0]-extra), prev[-1]+extra)
else:
merged[i-1] = (max(0,prev[0]-extra), prev[-1]+extra)
return list(filter(None, merged))
So lets create this list of spans. We set extra
to n-1
since n-1
letters on either side of a replacement will be affected.
merged = merge_spans(spans, n-1)
Now we can iterate over these spans and remove the counts for the windows that are affected by the substitution. Then we can do the substitution just within that span and update the counts.
for span in merged:
sub_s = s[span[0]:span[-1]]
for window in per_window(sub_s, n):
char_after[window[:2]][window[2]] -= 1
char_before[window[1:]][window[0]] -= 1
new_s = sub_s.replace(''.join(source), target)
for window in per_window(new_s, n):
char_after[window[:2]][window[2]] += 1
char_before[window[1:]][window[0]] += 1
Note that the above will affect the original char_before
and char_after
dictionaries but you can always make copies of them first if you need to keep the original counts for some reason.
Finally, we remove any counts from the counter that are 0
or negative and completely remove any windows that contain no positive counts at all. Note that adding Counter()
to a counter removes any elements who's value is not positive.
char_before2 = {k:v+Counter() for k,v in char_before.items() if any(v.values())}
char_after2 = {k:v+Counter() for k,v in char_after.items() if any(v.values())}
And the result:
>>> char_before2
{('d', 'x'): Counter({'c': 1}),
('e', 'f'): Counter({'x': 1}),
('x', 'e'): Counter({'x': 1}),
('x', 'x'): Counter({'d': 1})}
>>> char_after2
{('c', 'd'): Counter({'x': 1}),
('d', 'x'): Counter({'x': 1}),
('x', 'e'): Counter({'f': 1}),
('x', 'x'): Counter({'e': 1})}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With