Given a list of strings, where each string is in the format "A - something" or "B - somethingelse", and list items mostly alternate between pieces of "A" data and "B" data, how can irregularities be removed?
Example: A B A B A A B A B A B A B A B B A B A B A A B B A B A B
In this case, AAB (see rule 2), ABB (see rule 3) and AABB should be removed.
I'll give it a try with regexp returning indexes of sequences to be removed
>>> import re
>>> data = 'ABABAABABABABABBABABAABBABAB'
>>> [(m.start(0), m.end(0)) for m in re.finditer('(AA+B+)|(ABB+)', data)]
[(4, 7), (13, 16), (20, 24)]
or result of stripping
>>> re.sub('(AA+B+)|(ABB+)', '', data)
ABABABABABABABABAB
The drunk-on-itertools solution:
>>> s = 'ABABAABABABABABBABABAABBABAB'
>>> from itertools import groupby, takewhile, islice, repeat, chain
>>> groups = (list(g) for k,g in groupby(s))
>>> pairs = takewhile(bool, (list(islice(groups, 2)) for _ in repeat(None)))
>>> kept_pairs = (p for p in pairs if len(p[0]) == len(p[1]) == 1)
>>> final = list(chain(*chain(*kept_pairs)))
>>> final
['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']
(Unfortunately I'm now in no shape to think about corner cases and trailing A
s etc..)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With