Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiency of finding mismatched patterns

I'm working on a simple bioinformatics problem. I have a working solution, but it is absurdly inefficient. How can I increase my efficiency?


Problem:

Find patterns of length k in the string g, given that the k-mer can have up to d mismatches.

And these strings and patterns are all genomic--so our set of possible characters is {A, T, C, G}.

I'll call the function FrequentWordsMismatch(g, k, d).

So, here are a few helpful examples:

FrequentWordsMismatch('AAAAAAAAAA', 2, 1)['AA', 'CA', 'GA', 'TA', 'AC', 'AG', 'AT']

Here's a much longer example, if you implement this and want to test:

FrequentWordsMisMatch('CACAGTAGGCGCCGGCACACACAGCCCCGGGCCCCGGGCCGCCCCGGGCCGGCGGCCGCCGGCGCCGGCACACCGGCACAGCCGTACCGGCACAGTAGTACCGGCCGGCCGGCACACCGGCACACCGGGTACACACCGGGGCGCACACACAGGCGGGCGCCGGGCCCCGGGCCGTACCGGGCCGCCGGCGGCCCACAGGCGCCGGCACAGTACCGGCACACACAGTAGCCCACACACAGGCGGGCGGTAGCCGGCGCACACACACACAGTAGGCGCACAGCCGCCCACACACACCGGCCGGCCGGCACAGGCGGGCGGGCGCACACACACCGGCACAGTAGTAGGCGGCCGGCGCACAGCC', 10, 2)['GCACACAGAC', 'GCGCACACAC']

With my naive solution, that second example could easily take ~60 seconds, though the first one is pretty quick.


Naive solution:

My idea was to, for every k-length segment in g, find every possible "neighbor" (e.g. other k-length segments with up to d mismatches) and add those neighbors as keys to a dictionary. I then count how many times each one of those neighbor kmers show up in the string g, and record those in the dictionary.

Obviously that's a kinda shitty way to do that, since the amount of neighbors scales like crazy as k and d increase, and having to scan through the strings with each of those neighbors makes this implementation terribly slow. But alas, that's why I'm asking for help.

I'll put my code below. There're definitely a lot of novice mistakes to unpack, so thanks for your time and attention.

def FrequentWordsMismatch(g, k, d):
    '''
    Finds the most frequent k-mer patterns in the string g, given that those 
    patterns can mismatch amongst themselves up to d times

    g (String): Collection of {A, T, C, G} characters
    k (int): Length of desired pattern
    d (int): Number of allowed mismatches
    '''
    counts = {}
    answer = []

    for i in range(len(g) - k + 1):
        kmer = g[i:i+k]
        for neighborkmer in Neighbors(kmer, d):
            counts[neighborkmer] = Count(neighborkmer, g, d)

    maxVal = max(counts.values())

    for key in counts.keys():
        if counts[key] == maxVal:
            answer.append(key)

    return(answer)


def Neighbors(pattern, d):
    '''
    Find all strings with at most d mismatches to the given pattern

    pattern (String): Original pattern of characters
    d (int): Number of allowed mismatches
    '''
    if d == 0:
        return [pattern]

    if len(pattern) == 1:
        return ['A', 'C', 'G', 'T']

    answer = []

    suffixNeighbors = Neighbors(pattern[1:], d)

    for text in suffixNeighbors:
        if HammingDistance(pattern[1:], text) < d:
            for n in ['A', 'C', 'G', 'T']:
                answer.append(n + text)
        else:
            answer.append(pattern[0] + text)

    return(answer)


def HammingDistance(p, q):
    '''
    Find the hamming distance between two strings

    p (String): String to be compared to q
    q (String): String to be compared to p
    '''
    ham = 0 + abs(len(p)-len(q))

    for i in range(min(len(p), len(q))):
        if p[i] != q[i]:
            ham += 1

    return(ham)


def Count(pattern, g, d):
    '''
    Count the number of times that the pattern occurs in the string g, 
    allowing for up to d mismatches

    pattern (String): Pattern of characters
    g (String): String in which we're looking for pattern
    d (int): Number of allowed mismatches
    '''
    return len(MatchWithMismatch(pattern, g, d))

def MatchWithMismatch(pattern, g, d):
    '''
    Find the indicies at which the pattern occurs in the string g, 
    allowing for up to d mismatches

    pattern (String): Pattern of characters
    g (String): String in which we're looking for pattern
    d (int): Number of allowed mismatches
    '''
    answer = []
    for i in range(len(g) - len(pattern) + 1):
        if(HammingDistance(g[i:i+len(pattern)], pattern) <= d):
            answer.append(i)
    return(answer)

More tests

FrequentWordsMismatch('ACGTTGCATGTCGCATGATGCATGAGAGCT', 4, 1) → ['ATGC', 'ATGT', 'GATG']

FrequentWordsMismatch('AGTCAGTC', 4, 2) → ['TCTC', 'CGGC', 'AAGC', 'TGTG', 'GGCC', 'AGGT', 'ATCC', 'ACTG', 'ACAC', 'AGAG', 'ATTA', 'TGAC', 'AATT', 'CGTT', 'GTTC', 'GGTA', 'AGCA', 'CATC']

FrequentWordsMismatch('AATTAATTGGTAGGTAGGTA', 4, 0) → ["GGTA"]

FrequentWordsMismatch('ATA', 3, 1) → ['GTA', 'ACA', 'AAA', 'ATC', 'ATA', 'AGA', 'ATT', 'CTA', 'TTA', 'ATG']

FrequentWordsMismatch('AAT', 3, 0) → ['AAT']

FrequentWordsMismatch('TAGCG', 2, 1)  → ['GG', 'TG']
like image 512
julianstanley Avatar asked Jun 16 '18 17:06

julianstanley


2 Answers

The problem description is ambiguous in several ways, so I'm going by the examples. You seem to want all k-length strings from the alphabet (A, C, G, T} such that the number of matches to contiguous substrings of g is maximal - where "a match" means character-by-character equality with at most d character inequalities.

I'm ignoring that your HammingDistance() function makes something up even when inputs have different lengths, mostly because it doesn't make much sense to me ;-) , but partly because that isn't needed to get the results you want in any of the examples you gave.

The code below produces the results you want in all the examples, in the sense of producing permutations of the output lists you gave. If you want canonical outputs, I'd suggest sorting an output list before returning it.

The algorithm is pretty simple, but relies on itertools to do the heavy combinatorial lifting "at C speed". All the examples run in well under a second total.

For each length-k contiguous substring of g, consider all combinations(k, d) sets of d distinct index positions. There are 4**d ways to fill those index positions with letters from {A, C, G, T}, and each such way is "a pattern" that matches the substring with at most d discrepancies. Duplicates are weeded out by remembering the patterns already generated; this is faster than making heroic efforts to generate only unique patterns to begin with.

So, in all, the time requirement is O(len(g) * k**d * 4**d) = O(len(g) * (4*k)**d, where k**d is, for reasonably small values of k and d, an overstated standin for the binomial coefficent combinations(k, d). The important thing to note is that - unsurprisingly - it's exponential in d.

def fwm(g, k, d):
    from itertools import product, combinations
    from collections import defaultdict

    all_subs = list(product("ACGT", repeat=d))
    all_ixs = list(combinations(range(k), d))
    patcount = defaultdict(int)

    for starti in range(len(g)):
        base = g[starti : starti + k]
        if len(base) < k:
            break
        patcount[base] += 1
        seen = set([base])
        basea = list(base)
        for ixs in all_ixs:
            saved = [basea[i] for i in ixs]
            for newchars in all_subs:
                for i, newchar in zip(ixs, newchars):
                    basea[i] = newchar
                candidate = "".join(basea)
                if candidate not in seen:
                    seen.add(candidate)
                    patcount[candidate] += 1
            for i, ch in zip(ixs, saved):
                basea[i] = ch

    maxcount = max(patcount.values())
    return [p for p, c in patcount.items() if c == maxcount]

EDIT: Generating Patterns Uniquely

Rather than weed out duplicates by keeping a set of those seen so far, it's straightforward enough to prevent generating duplicates to begin with. In fact, the following code is shorter and simpler, although somewhat subtler. In return for less redundant work, there are layers of recursive calls to the inner() function. Which way is faster appears to depend on the specific inputs.

def fwm(g, k, d):
    from collections import defaultdict

    patcount = defaultdict(int)
    alphabet = "ACGT"
    allbut = {ch: tuple(c for c in alphabet if c != ch)
              for ch in alphabet}

    def inner(i, rd):
        if not rd or i == k:
            patcount["".join(base)] += 1
            return
        inner(i+1, rd)
        orig = base[i]
        for base[i] in allbut[orig]:
            inner(i+1, rd-1)
        base[i] = orig

    for i in range(len(g) - k + 1):
        base = list(g[i : i + k])
        inner(0, d)

    maxcount = max(patcount.values())
    return [p for p, c in patcount.items() if c == maxcount]
like image 175
Tim Peters Avatar answered Nov 06 '22 06:11

Tim Peters


Going on your problem description alone and not your examples (for the reasons I explained in the comment), one approach would be:

s = "CACAGTAGGCGCCGGCACACACAGCCCCGGGCCCCGGGCCGCCCCGGGCCGGCGGCCGCCGGCGCCGGCACACCGGCACAGC"\
    "CGTACCGGCACAGTAGTACCGGCCGGCCGGCACACCGGCACACCGGGTACACACCGGGGCGCACACACAGGCGGGCGCCGGG"\
    "CCCCGGGCCGTACCGGGCCGCCGGCGGCCCACAGGCGCCGGCACAGTACCGGCACACACAGTAGCCCACACACAGGCGGGCG"\
    "GTAGCCGGCGCACACACACACAGTAGGCGCACAGCCGCCCACACACACCGGCCGGCCGGCACAGGCGGGCGGGCGCACACAC"\
    "ACCGGCACAGTAGTAGGCGGCCGGCGCACAGCC"

def frequent_words_mismatch(g,k,d):
    def num_misspellings(x,y):
        return sum(xx != yy for (xx,yy) in zip(x,y))

    seen = set()
    for i in range(len(g)-k+1):
        seen.add(g[i:i+k])

    # For each unique sequence, add a (key,bin) pair to the bins dictionary
    #  (The bin is initialized to a list containing only the sequence, for now)
    bins = {seq:[seq,] for seq in seen}
    # Loop again through the unique sequences...
    for seq in seen:
        # Try to fit it in *all* already-existing bins (based on bin key)
        for bk in bins:
            # Don't re-add seq to it's own bin
            if bk == seq: continue
            # Test bin keys, try to find all appropriate bins
            if num_misspellings(seq, bk) <= d:
                bins[bk].append(seq)

    # Get a list of the bin keys (one for each unique sequence) sorted in order of the
    #   number of elements in the corresponding bins
    sorted_keys = sorted(bins, key= lambda k:len(bins[k]), reverse=True)

    # largest_bin_key will be the key of the largest bin (there may be ties, so in fact
    #   this is *a* key of *one of the bins with the largest length*).  That is, it'll
    #   be the sequence (found in the string) that the most other sequences (also found
    #   in the string) are at most d-distance from.
    largest_bin_key = sorted_keys[0]

    # You can return this bin, as your question description (but not examples) indicate:
    return bins[largest_bin_key]

largest_bin = frequent_words_mismatch(s,10,2)
print(len(largest_bin))     # 13
print(largest_bin)

The (this) largest bin contains:

['CGGCCGCCGG', 'GGGCCGGCGG', 'CGGCCGGCGC', 'AGGCGGCCGG', 'CAGGCGCCGG',
 'CGGCCGGCCG', 'CGGTAGCCGG', 'CGGCGGCCGC', 'CGGGCGCCGG', 'CCGGCGCCGG',
 'CGGGCCCCGG', 'CCGCCGGCGG', 'GGGCCGCCGG']

It's O(n**2) where n is the number of unique sequences and completes on my computer in around 0.1 seconds.

like image 2
jedwards Avatar answered Nov 06 '22 07:11

jedwards