Would using hash in python speed up string/char comparison?

Question

I know that converting strings in python can help speed up string comparison. But I'm not sure if my code can be accelerated by converting my strings to hashes.

I am starting with line 1 and looking for all lines that are within a levenshtein distance of 1 and adding them all to a list. So in the below example the first three lines all get added to a list and the fourth line does not. Then I am iterating through that list and comparing the characters found in each string at pos 1,2,3... Finally, I am outputting a string that is sort of like an average such that the first character is an h because it is the most common letter found at char position 1, and the second character is an e because that is the most common second character, and so on.

The problem is that this is with very large amounts of sequencing data, and has already been running for a week on a cluster and has not made very rapid progress.

So, if I convert all strings to hashes first can that be used to find levenshtein distance and then character differences or not. Because from what I can tell this isn't possible with hashes. Thanks.

Example Input file:

hello
hallo
helpo
hrowi

Example Output file:

hello

John Coleman · Accepted Answer

On Edit: I am adding a second version which is less dependent on the bioinformatics perspective but am keeping the first version since for strict bioinformatics it might be better.

Here is a possible answer for the alphabet "ATGC". I based it on the notion of DNA fingerprinting. I have a possibly large list of large strands. I randomly select a subset of those positions to be "fingerprinted". In my test data the strands were of length 10,000 and I randomly picked 100 positions. I form a fingerprint of the characters in the first line at those positions. Then, based on that fingerprint I create a set of "smudged fingerprints" -- which consists of the original fingerprint and those fingerprints at Hamming distance 1. I then iterate through the rest of the list. I first fingerprint the strand and see if it is in the set of smudges. If it is -- then I look if the strand is within Hamming distance 1. If it is -- I update a dictionary I am maintaining at each position, keyed by the characters. Finally I create the summary string. With 10,000 strands each of which has length 10,000 it only takes a few seconds to both create the these strings and compute the summary string:

import random

def randDNA(n):
    return ''.join(random.choice("ATGC") for i in range(n))

def mutate(strand,times):
    nucleotides = list(strand)
    n = len(strand)
    for i in range(times):
        j = random.randint(0,n-1)
        nucleotides[i] = random.choice("ATGC")
    return ''.join(nucleotides)

def HammingClose(s,t):
    #assumes s and t are strings of the same length
    #returns True if s,t are Hamming distance at most 1
    #otherwise returns False

    clashes = 0
    for x,y in zip(s,t):
        if x != y:
            clashes += 1
            if clashes > 1: return False
    return True

def takeFingerPrint(strand,places):
    return ''.join(strand[i] for i in places)

def smudges(fp):
    s = set([fp])
    for i,c in enumerate(fp):
        s.update([fp[:i] + d + fp[(i+1):] for d in "ATGC" if d != c])
    return s

def summary(strandList, fpSize = 10):
    n = len(strandList)
    refStrand = strandList[0]
    ncount = 1 #count of refstrand + Hamming-neighbors
    position = {}

    for i,c in enumerate(refStrand):
        position[i] = dict.fromkeys("ATGC",0)
        position[i][c] += 1

    #take a random fingerprint of size fpSize
    places = random.sample(range(n),fpSize)
    refPrint = takeFingerPrint(refStrand,places)
    s = smudges(refPrint)

    for strand in strandList[1:]:
        fp = takeFingerPrint(strand,places)
        if fp in s: #maybe a hit!
            if HammingClose(strand,refStrand):
                ncount += 1
                for i,c in enumerate(strand):
                    position[i][c] += 1


    #assemble summary strand

    mode = []
    for i in range(len(refStrand)):
        c = "A"
        m = position[i]["A"]
        for x in "TGC":
            if position[i][x] > m:
                c = x
                m = position[i][x]
        mode.append(c)

    return(ncount,''.join(mode))

#example problem

strand = randDNA(10000)
strandList = [mutate(strand,5) for i in range(10000)]

n,s = summary(strandList,100)
print(n, "close strands found")
print("First 30 positions in summary strand are ", s[:30])

Sample run:

158 close strands found
First 30 positions in summary strand are  CAAGGTCGTCGCCCATAAACGTTTTTCCCA

My code didn't address the problem of ties, but should be easily modifiable to do so.

Second version. You can replace the ALPHA in the code by whatever character set you are using. Fingerprints are now initial slices. I form the set of all strings at Hamming distance exactly one from the initial slice of the first line. Then on iteration I check if either the initial slice equals of the line equals the reference slice and the rest of the line is at Hamming distance at most 1 or if the initial slice is in the set of slices at Hamming distance 1, in which case the rest of the line must equal the rest of the first line. I am assuming that the Python interpreter can test strings for equality faster than it can execute a loop. The resulting code seems to run over twice as fast as my initial code:

import random

ALPHA = "ATGC"

def randDNA(n):
    return ''.join(random.choice(ALPHA) for i in range(n))

def mutate(strand,times):
    nucleotides = list(strand)
    n = len(strand)
    for i in range(times):
        j = random.randint(0,n-1)
        nucleotides[i] = random.choice(ALPHA)
    return ''.join(nucleotides)

def HammingOne(s,t):
    #assumes s and t are strings of the same length
    #returns True if s,t are Hamming distance 1
    #otherwise returns False

    clashes = 0
    for x,y in zip(s,t):
        if x != y:
            clashes += 1
            if clashes > 1: return False
    return True if clashes == 1 else False


def neighbors(s):
    n = set()
    for i,c in enumerate(s):
        n.update([s[:i] + d + s[(i+1):] for d in ALPHA if d != c])
    return n

def summary(sList, fpSize = 10):
    n = len(sList)
    refString = sList[0]
    ncount = 0 #count of Hamming-neighbors
    position = {}

    for i,c in enumerate(refString):
        position[i] = dict.fromkeys(ALPHA,0)
        position[i][c] += 1


    refPrint = refString[:fpSize]
    s = neighbors(refPrint)
    refTail = refString[fpSize:]

    for strand in sList[1:]:
        fp = strand[:fpSize]
        if (fp == refPrint) and \
           (strand[fpSize:] == refTail or HammingOne(strand[fpSize:],refTail)) or \
           (HammingOne(fp,refPrint) and strand[fpSize:] == refTail):
            ncount += 1
            for i,c in enumerate(strand):
                position[i][c] += 1

    #assemble summary strand

    mode = []
    for i in range(len(refString)):
        c = ALPHA[0]
        m = position[i][c]
        for x in ALPHA[1:]:
            if position[i][x] > m:
                c = x
                m = position[i][x]
        mode.append(c)

    return(ncount,''.join(mode))

#example problem

strand = randDNA(10000)
sList = [mutate(strand,5) for i in range(10000)]

n,s = summary(sList,100)
print(n, "close strands found")
print("First 30 positions in summary strand are ", s[:30])

Would using hash in python speed up string/char comparison?

Tags:

python

hash

The Nightman

1 Answers

John Coleman

Recent Activity

Donate For Us

Would using hash in python speed up string/char comparison?

Tags:

python

hash

The Nightman

1 Answers

John Coleman

Related questions

Recent Activity

Donate For Us