Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to speed up itertools combinations?

As part of my own research on genotype networks, I have a block of code in which I am trying to construct a single-difference network from a list of strings. The procedure is as such:

  1. Iterate over all pairwise combinations of strings.
  2. If the strings differ by a single position, then draw an edge between them in the network.
  3. If they don't differ by a single position, then pass.

The block of code I have right now is as such:

from itertools import combinations
import Levenshtein as lev # a package that wraps a C-implemented levenshtein distance
import networkx as nx
strings = [...list of strings...]

G = nx.Graph()

for (s1, s2) in combinations(strings, 2):
    if s1 not in G.nodes():
        G.add_node(s1)
    if s2 not in G.nodes():
        G.add_node(s2)

    if lev.distance(s1, s2) == 1:
        G.add_edge(s1, s2)

There's clearly no way to improve the computational complexity of the graph construction procedure - it will always be O(n**2). At least, in my limited knowledge, that's what I think - perhaps I'm wrong?

That said, given the normal scale of the number of comparisons that I need to make (~approx. 2000-5000), if I can get a few orders of magnitude speedup, overall, then the real-world computing time would be still acceptable - with the current Python implementation, it takes ~days to construct the graph. With the correct imports (not stated below), I have tried a Cython implementation below, but could not figure out how to make it faster:

cpdef cython_genotype_network(sequences):

    G = nx.Graph()
    cdef:
        unicode s1
        unicode s2

    for (s1, s2) in combinations(sequences, 2):
        if lev.distance(s1, s2) == 1:
            G.add_edge(s1, s2)

    return G

Particularly, Cython expects bytes, not str for s1 and s2. That block of code throws an error.

So... I come to my two questions:

  • Q1: Will the Cython implementation help? And how do I fix the bytes vs. str error?
  • Q2: Is it possible to do this problem using numpy instead? It's easy to convert from a numpy matrix to NetworkX; however, I can't seem to figure out how to broadcast the Levenshtein distance function across an n-by-n empty matrix where the each row and column corresponds to a string.

UPDATE 1: How to generate sample data

To generate strings:

from random import choice

def create_random_nucleotides_python(num_nucleotides):
    """
    Creates random nucleotides of length num_nucleotides.
    """

    sequence = ''

    letters = ['A', 'T', 'G', 'C']

    for i in range(num_nucleotides):
        sequence = sequence + choice(letters)

    return sequence


def mutate_random_position(string):
    """
    Mutates one position in the nucleotide sequence at random.
    """

    positions = [i for i in range(len(string))]
    pos_to_mut = choice(positions)

    letters = ['A', 'T', 'G', 'C']

    new_string = ''
    for i, letter in enumerate(string):
        if i == pos_to_mut:
            new_string = new_string + choice(letters)
        else:
            new_string = new_string + letter

    return new_string


# Create 100 Python srings by mutating a first sequence, then randomly choosing stuff to mutate a single position.
base_sequence = create_random_nucleotides_python(1000)

sequences = [base_sequence]

for i in range(99):
    sequence = choice(sequences)
    mutseq = mutate_random_position(sequence)
    sequences.append(mutseq)
like image 546
ericmjl Avatar asked Oct 31 '22 23:10

ericmjl


1 Answers

Regarding complexity:

You are considering each pair of strings. You don't need that. You can consider all 1-distance strings for each strings:

# I would start by populating the whole graph:
for s1 in strings:
    if s1 not in G.nodes():
        G.add_node(s1)
# Then compute the leven-1:
for s1 in strings:
    for s2 in variations(s1):
        if s2 in G.nodes():
            G.add_edge(s1, s2)

Now all your need is a variations(string) shorter than O(n):

This returns all variations with a distance of 1. (only 1 edit|delete|insert)

def variations(string):
    for i in range(len(string)):
        # delete
        yield string[:i] + string[i+1:]
        # edit
        for l in 'ATGC':
            yield string[:i] + l + string[i+1:]
        # insert
        for l in 'ATGC':
            yield string[:i] + l + string[i:]

    # insert at the end
    for l in 'ATGC':
        yield string + l

Now, the complexity of that is O(m^2) (because of the string concat), where m is the size of the longest sequence. If it is known, it is a constant, and all that is now O(1)

If the sequences are all the same size, you can compute only the edits.

Else, you can sort your sequences from largest to smallest, and only compute edits and deletes.

Or, you can sort your strings by size, and instead of comparing all strings with all other, compare those that have a size difference of <= 1.

like image 73
njzk2 Avatar answered Nov 14 '22 15:11

njzk2