this is easy, I just can't do it! In this example, all I want to do is split the string below into chunks of same letters that are beside each other, e.g. in the below example: test = "AAATGG", would be split into "AAA","T","GG". I've been trying different ways, one example below. I'd appreciate the help.
I know the idea is to go through the string, if the next letter is the same as the current letter, continue on, else, break and print and start again, I just can't implement it properly.
test = "AAATGG"
TestDict = {}
for index,i in enumerate(test[:-1]):
string = ""
if test[index] == test[index+1]:
string = i + test[index]
else:
break
print string
One way is to use groupby
from itertools
:
from itertools import groupby
[''.join(g) for _, g in groupby(test)]
# ['AAA', 'T', 'GG']
I'd probably just use itertools.groupby
:
>>> import itertools as it
>>> s = 'AAATGG'
>>> for k, g in it.groupby(s):
... print(k, list(g))
...
('A', ['A', 'A', 'A'])
('T', ['T'])
('G', ['G', 'G'])
>>>
>>> # Multiple non-consecutive occurrences of a given value.
>>> s = 'AAATTGGAAA'
>>> for k, g in it.groupby(s):
... print(k, list(g))
...
('A', ['A', 'A', 'A'])
('T', ['T', 'T'])
('G', ['G', 'G'])
('A', ['A', 'A', 'A'])
As you can see, g
becomes an iterable that yields all consecutive occurrences of the given character (k
). I used list(g)
, to consume the iterable, but you could do anything you like with it (including ''.join(g)
to get a string, or sum(1 for _ in g)
to get the count).
You can use regex:
>>> re.findall(r'((\w)\2*)', test)
[('AAA', 'A'), ('T', 'T'), ('GG', 'G')]
You could also use regex.findall
. In this case, I assumed only the letters A, T, C, and G are present.
import re
re.findall('(A+|T+|G+|C+)', test)
['AAA', 'T', 'GG']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With