I am trying to write a script that will perform two functions, when provided with two strings:
1. Find the longest sequence of characters starting from pos[0]
that is the same in both strings
Seq1 = 'ATCCTTAGC'
Seq2 = 'ATCCAGCAATTC'
^^^^ Match from pos[0] to pos[3]
Pos: 0:3
Length: 4
Seq: ATCC
2. Find the longest run of characters that exists in both strings
Seq1 = 'TAGCTCCTTAGC' # Contains 'TCCTT'
Seq2 = 'GCAGCCATCCTTA' # Contains 'TCCTT'
^ No match at pos[0]
Pos1: 4:8
Pos2 7:11
Length: 5
Seq: TCCTT
To accomplish problem 1, I have the following:
#!/usr/bin/python
upstream_seq = 'ATACATTGGCCTTGGCTTAGACTTAGATCTAGACCTGAAAATAACCTGCCGAAAAGACCCGCCCGACTGTTAATACTTTACGCGAGGCTCACCTTTTTGTTGTGCTCCC'
downstream_seq = 'ATACACGAAAAGCGTTCTTTTTTTGCCACTTTTTTTTTATGTTTCAAAACGGAAAATGTCGCCGTCGTCGGGAGAGTGCCTCCTCTTAGTTTATCAAATAAAGCTTTCG'
print("Upstream: %s\nDownstream: %s\n") % (upstream_seq, downstream_seq)
mh = 0
pos_count = 0
seq = ""
position =""
longest_hom=""
for i in range(len(upstream_seq)):
pos_count += 1
if upstream_seq[i] == downstream_seq[i]:
mh += 1
seq += upstream_seq[i]
position = pos_count
longest_hom = mh
else:
mh = 0
break
print("Pos: 0:%s\nLength: %s\nSeq: %s\n") % (position , longest_hom, seq)
Upstream: ATACATTGGCCTTGGCTTAGACTTAGATCTAGACCTGAAAATAACCTGCCGAAAAGACCCGCCCGACTGTTAATACTTTACGCGAGGCTCACCTTTTTGTTGTGCTCCC
Downstream: ATACACGAAAAGCGTTCTTTTTTTGCCACTTTTTTTTTATGTTTCAAAACGGAAAATGTCGCCGTCGTCGGGAGAGTGCCTCCTCTTAGTTTATCAAATAAAGCTTTCG
Pos: 0:5
Length: 5
Seq: ATACA
I'm having trouble with Problem 2. So far, I've considered an alignment between the two sequences using BioPython's pairwise2. However, in this case, I only want perfect matches (no gaps, no extensions), and I only want to see the longest sequence, not a consensus which is what I appear to get:
from Bio import pairwise2 as pw2
global_align = pw2.align.globalms(upstream_seq, downstream_seq, 3, -1, -.5, -.5)
print(global_align[0])
('ATACATT-G----GCC-TTGGCTTA-----G--ACTTAGATCTAG-----ACCTGAA----AATAACCTGCCGAAAA-GACC-CGCCCGACTGTTAATACTT-TACGCG-AG-GCT-CAC-C-T-TT--TTGT-TG----T---GCTCC--C-', 'ATACA--CGAAAAG-CGTT--CTT-TTTTTGCCACTT---T-T--TTTTTA--TG--TTTCAA-AA-C-G--GAAAATG---TCG--C--C-G----T-C--GT-CG-GGAGAG-TGC-CTCCTCTTAGTT-TAT-CAAATAAAGCT--TTCG', 151.0, 0, 153)
Question: How can I find the longest run of characters that exists in both strings?
Here's a shorter code for Problem 1:
upstream_seq = 'ATACATTGGCCTTGGCTTAGACTTAGATCTAGACCTGAAAATAACCTGCCGAAAAGACCCGCCCGACTGTTAATACTTTACGCGAGGCTCACCTTTTTGTTGTGCTCCC'
downstream_seq = 'ATACACGAAAAGCGTTCTTTTTTTGCCACTTTTTTTTTATGTTTCAAAACGGAAAATGTCGCCGTCGTCGGGAGAGTGCCTCCTCTTAGTTTATCAAATAAAGCTTTCG'
common_prefix = ''
for x,y in zip(upstream_seq, downstream_seq):
if x == y:
common_prefix += x
else:
break
print(common_prefix)
# ATACA
The naive approach for Problem 2 would be to simply generate a set of every substrings for each string, calculate their intersection and sort by length:
upstream_seq = 'ATACATTGGCCTTGGCTTAGACTTAGATCTAGACCTGAAAATAACCTGCCGAAAAGACCCGCCCGACTGTTAATACTTTACGCGAGGCTCACCTTTTTGTTGTGCTCCC'
downstream_seq = 'ATACACGAAAAGCGTTCTTTTTTTGCCACTTTTTTTTTATGTTTCAAAACGGAAAATGTCGCCGTCGTCGGGAGAGTGCCTCCTCTTAGTTTATCAAATAAAGCTTTCG'
def all_substrings(string):
n = len(string)
return {string[i:j+1] for i in range(n) for j in range(i,n)}
print(all_substrings('ABCA'))
# {'CA', 'BC', 'ABC', 'C', 'BCA', 'AB', 'A', 'B', 'ABCA'}
print(all_substrings(upstream_seq) & all_substrings(downstream_seq))
# {'AAAG', 'CA', 'A', 'AAC', 'TGTT', 'ACT', 'CTTAG', 'GCT', 'ATAC', 'AAAA', 'TTTA', 'AAT', 'GTGC', 'CTT', 'AAAAG', 'TTTG', 'CGAA', 'AA', 'CGAAAAG', 'GCC', 'ACA', 'TGCC', 'AAATAA', 'CTCC', 'TTTTT', 'CGCC', 'CAC', 'GAG', 'CTC', 'CGAAAA', 'ATC', 'TCA', 'GA', 'CGC', 'TGT', 'GT', 'GC', 'GAAA', 'ACTTT', 'AAG', 'TTTT', 'CT', 'AATA', 'TCC', 'CGAAA', 'GAA', 'GAAAAG', 'GTT', 'AG', 'TC', 'AAAAT', 'CC', 'TTT', 'AATAA', 'CTTTT', 'ACTT', 'TTA', 'CTTT', 'GCTT', 'GCCG', 'GTG', 'TACA', 'TT', 'GCG', 'TTTTTG', 'TAG', 'TTG', 'TTAG', 'AAATA', 'CTTTTT', 'AAAT', 'TAA', 'ACG', 'TG', 'GCCT', 'G', 'TAC', 'CCT', 'TCT', 'ATA', 'CTTA', 'CCG', 'CG', 'ATAA', 'GG', 'ATACA', 'AGA', 'TGC', 'C', 'T', 'AT', 'GAAAA', 'CGA', 'GAAAAT', 'TA', 'AC', 'AAA', 'TTTTG'}
print(max(all_substrings(upstream_seq) & all_substrings(downstream_seq), key=len))
# CGAAAAG
If you want a more efficient approach, you should use a suffix tree.
If you don't want to reinvent the wheel, you could simply use difflib.SequenceMatcher.find_longest_match
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With