I have a list of sequences l (many 1000's of sequences): <code>l = [ABCD,AABA,...]</code>. I also have a file <code>f</code> with many 4 letter sequences (around a million of them). I want to choose the closest string in <code>l</code> for every sequence in <code>f</code> up to a Hamming distance of atmost 2, and update the counter <code>good_count</code>. I wrote the following code for this but it is very slow. I was wondering if it can be done faster. <pre class="prettyprint"><code>def hamming(s1, s2): if len(s1) != len(s2): raise ValueError("Undefined for sequences of unequal length") return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2)) f = open("input.txt","r") l = [ABCD,AABA,...] good_count = 0 for s in f: x = f.readline() dist_array = [] for ll in l: dist = hamming(x,ll) dist_array.append(dist) min_dist = min(dist_array) if min_dist <= 2: good_count += 1 print good_count </code></pre> It works fast if <code>l</code> and <code>f</code> are small but takes too long for large <code>l</code> and <code>f</code>. Please suggest a quicker solution.

Use existing libraries, for instance jellyfish: <pre class="prettyprint"><code>from jellyfish import hamming_distance </code></pre> Which gives you a C implementation of the hamming distance.

Oh, you're just counting how MANY have matches with a hamming distance < 2? That can be done much quicker. <pre class="prettyprint"><code>total_count = 0 for line in f: # skip the s = f.readline() since that's what `line` is in this line = line.strip() # just in case for ll in l: if hamming(line, ll) <= 2: total_count += 1 break # skip the rest of the ll in l loop # and then you don't need any processing afterwards either. </code></pre> Note that most of your code time will be spent on the line: <pre class="prettyprint"><code> if hamming(line, ll) <= 2: </code></pre> So any way you can improve that algorithm will GREATLY improve your overall script speed. Boud's answer extols the virtues of <code>jellyfish</code>'s <code>hamming_distance</code> function, but without any personal experience I can't recommend it myself. However his advice to use a faster implementation of hamming distance is sound! <hr> Peter DeGlopper suggests blowing the <code>l</code> list into six different sets of "Two or less hamming distance" matches. That is, a group of sets that contain all the possible pairs that could have two or less hamming distance. This might look like: <pre class="prettyprint"><code># hamming_sets is [ {AB??}, {A?C?}, {A??D}, {?BC?}, {?B?D}, {??CD} ] hamming_sets = [ set(), set(), set(), set(), set(), set() ] for ll in l: # this should take the lion's share of time in your program hamming_sets[0].add(l[0] + l[1]) hamming_sets[0].add(l[0] + l[2]) hamming_sets[0].add(l[0] + l[3]) hamming_sets[0].add(l[1] + l[2]) hamming_sets[0].add(l[1] + l[3]) hamming_sets[0].add(l[2] + l[3]) total_count = 0 for line in f: # and this should be fast, even if `f` is large line = line.strip() if line[0]+line[1] in hamming_sets[0] or \ line[0]+line[2] in hamming_sets[1] or \ line[0]+line[3] in hamming_sets[2] or \ line[1]+line[2] in hamming_sets[3] or \ line[1]+line[3] in hamming_sets[4] or \ line[2]+line[3] in hamming_sets[5]: total_count += 1 </code></pre> You could possibly gain readability by making <code>hamming_sets</code> a dictionary of <code>transform_function: set_of_results</code> key value pairs. <pre class="prettyprint"><code>hamming_sets = {lambda s: s[0]+s[1]: set(), lambda s: s[0]+s[2]: set(), lambda s: s[0]+s[3]: set(), lambda s: s[1]+s[2]: set(), lambda s: s[1]+s[3]: set(), lambda s: s[2]+s[3]: set()} for func, set_ in hamming_sets.items(): for ll in l: set_.add(func(ll)) total_count = 0 for line in f: line = line.strip() if any(func(line) in set_ for func, set_ in hamming_sets.items()): total_count += 1 </code></pre>

iterating through a list in python

Tags:

I have a list of sequences l (many 1000's of sequences): l = [ABCD,AABA,...]. I also have a file f with many 4 letter sequences (around a million of them). I want to choose the closest string in l for every sequence in f up to a Hamming distance of atmost 2, and update the counter good_count. I wrote the following code for this but it is very slow. I was wondering if it can be done faster.

def hamming(s1, s2):
    if len(s1) != len(s2):
            raise ValueError("Undefined for sequences of unequal length")  
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

f = open("input.txt","r")

l = [ABCD,AABA,...]

good_count = 0
for s in f:
     x = f.readline()
     dist_array = []
     for ll in l:
        dist = hamming(x,ll)
        dist_array.append(dist)
     min_dist = min(dist_array)
     if min_dist <= 2:
        good_count += 1
print good_count

It works fast if l and f are small but takes too long for large l and f. Please suggest a quicker solution.

695

asked Jan 17 '17 20:01

Ssank

2 Answers

Use existing libraries, for instance jellyfish:

from jellyfish import hamming_distance

Which gives you a C implementation of the hamming distance.

answered Sep 23 '22 11:09

Zeugma

Oh, you're just counting how MANY have matches with a hamming distance < 2? That can be done much quicker.

total_count = 0

for line in f:
    # skip the s = f.readline() since that's what `line` is in this
    line = line.strip()  # just in case
    for ll in l:
        if hamming(line, ll) <= 2:
            total_count += 1
            break  # skip the rest of the ll in l loop
    # and then you don't need any processing afterwards either.

Note that most of your code time will be spent on the line:

        if hamming(line, ll) <= 2:

So any way you can improve that algorithm will GREATLY improve your overall script speed. Boud's answer extols the virtues of jellyfish's hamming_distance function, but without any personal experience I can't recommend it myself. However his advice to use a faster implementation of hamming distance is sound!

Peter DeGlopper suggests blowing the l list into six different sets of "Two or less hamming distance" matches. That is, a group of sets that contain all the possible pairs that could have two or less hamming distance. This might look like:

# hamming_sets is [ {AB??}, {A?C?}, {A??D}, {?BC?}, {?B?D}, {??CD} ]
hamming_sets = [ set(), set(), set(), set(), set(), set() ]

for ll in l:
    # this should take the lion's share of time in your program
    hamming_sets[0].add(l[0] + l[1])
    hamming_sets[0].add(l[0] + l[2])
    hamming_sets[0].add(l[0] + l[3])
    hamming_sets[0].add(l[1] + l[2])
    hamming_sets[0].add(l[1] + l[3])
    hamming_sets[0].add(l[2] + l[3])

total_count = 0

for line in f:
    # and this should be fast, even if `f` is large
    line = line.strip()
    if line[0]+line[1] in hamming_sets[0] or \
       line[0]+line[2] in hamming_sets[1] or \
       line[0]+line[3] in hamming_sets[2] or \
       line[1]+line[2] in hamming_sets[3] or \
       line[1]+line[3] in hamming_sets[4] or \
       line[2]+line[3] in hamming_sets[5]:
        total_count += 1

You could possibly gain readability by making hamming_sets a dictionary of transform_function: set_of_results key value pairs.

hamming_sets = {lambda s: s[0]+s[1]: set(),
                lambda s: s[0]+s[2]: set(),
                lambda s: s[0]+s[3]: set(),
                lambda s: s[1]+s[2]: set(),
                lambda s: s[1]+s[3]: set(),
                lambda s: s[2]+s[3]: set()}

for func, set_ in hamming_sets.items():
    for ll in l:
        set_.add(func(ll))

total_count = 0

for line in f:
    line = line.strip()
    if any(func(line) in set_ for func, set_ in hamming_sets.items()):
        total_count += 1

answered Sep 22 '22 11:09

Adam Smith

Related questions
                            
                                Environment Variables in Hybris local.properties
                            
                                Update JSON file without destroying JSON format
                            
                                PHP - New Imagick changes md5() of an image
                            
                                Get exit codes of a pipe when output is assigned to variable (Command Substitution) [duplicate]
                            
                                How to build an executable that depends on curl for x86_64-unknown-linux-musl
                            
                                How to create Quartz.Net Job from lambda?
                            
                                Call a POST request for logging from window unload
                            
                                Set file owner:group in Linux using Qt/C++
                            
                                Are laravel's routes safeguarding enough against file traversal attacks?
                            
                                VBA cDate not working on Mac excel 2011(14.7.1)
                            
                                Misunderstanding on JavaBeans naming convention for methods on OCA
                            
                                How to install aparapi

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With