I have string of some length consisting of only 4 characters which are 'A,T,G and C'. I have pattern 'GAATTC' present multiple times in the given string. I have to cut the string at intervals where this pattern is.. For example for a string, 'ATCGAATTCATA', I should get output of
ATCGAATTCATAI am newbie in using Python but I have come up with the following (incomplete) code:
seq = seq.upper()
str1 = "GAATTC"
seqlen = len(seq)
seq = list(seq)
for i in range(0,seqlen-1):
site = seq.find(str1)
print(site[0:(i+2)])
Any help would be really appreciated.
First lets develop your idea of using find, so you can figure out your mistakes.
seq = 'ATCGAATTCATAATCGAATTCATAATCGAATTCATA'
seq = seq.upper()
pattern = "GAATTC"
split_at = 2
seqlen = len(seq)
i = 0
while i < seqlen:
site = seq.find(pattern, i)
if site != -1:
print(seq[i: site + split_at])
i = site + split_at
else:
print seq[i:]
break
Yet python string sports a powerful replace method that directly replaces fragments of string. The below snippet uses the replace method to insert separators when needed:
seq = 'ATCGAATTCATAATCGAATTCATAATCGAATTCATA'
seq = seq.upper()
pattern = "GA","ATTC"
pattern1 = ''.join(pattern) # 'GAATTC'
pattern2 = ' '.join(pattern) # 'GA ATTC'
splited_seq = seq.replace(pattern1, pattern2) # 'ATCGA ATTCATAATCGA ATTCATAATCGA ATTCATA'
print (splited_seq.split())
I believe it is more intuitive and should be faster then RE (which might have lower performance, depending on library and usage)
Here is a simple solution :
seq = 'ATCGAATTCATA'
seq_split = seq.upper().split('GAATTC')
result = [
(seq_split[i] + 'GA') if i % 2 == 0 else ('ATTC' + seq_split[i])
for i in range(len(seq_split)) if len(seq_split[i]) > 0
]
Result :
print(result)
['ATCGA', 'ATTCATA']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With