Programming language: Python 3.4
I have written a program for the Bioinformatics 1 course from Coursera. The program is working all right, but is very slow for large datasets. I guess, it is because the loop is running for 4**k times, where k is the length of the sub-string that is passed into the function. Input: Strings Text and Pattern along with an integer d. Output: All starting positions where Pattern appears as a substring of Text with at most d mismatches.
This is my code:
def MotifCount(string1, substring, d):
k = 4 ** (len(substring))
codeArray = list(itertools.product(['A', 'C', 'G', 'T'], repeat=len(substring)))
for i in range(k):
codeArray2 = ''.join(list(codeArray[i]))
HammingValue = HammingDistance(codeArray2, substring)
if HammingValue <= d:
for j in range(len(string1)):
if(string1.find(codeArray2, j) == j):
print(j)
def HammingDistance(string_1, string_2):
length_1 = len(string_1)
length_2 = len(string_2)
count = 0
for i in range(length_1):
if string_1[i] != string_2[i]:
count += 1
return count
Sample Input:
CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT
ATTCTGGA
3
Output:
6 7 26 27
I want to optimize this code for bigger data sets. Is there any way to reduce the run time of the code?
import itertools
def HammingDistance(string_1, string_2):
assert len(string_1) == len(string_2)
return sum(c1 != c2 for c1, c2 in zip(string_1, string_2))
def MotifCount(string1, substring, d):
for i in range(len(string1) - len(substring) + 1):
if HammingDistance(string1[i:i+len(substring)], substring) <= d:
print(i)
MotifCount("CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT", "ATTCTGGA", 3)
It gives:
6
7
26
27
Quickly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With