As part of my own research on genotype networks, I have a block of code in which I am trying to construct a single-difference network from a list of strings. The procedure is as such:
The block of code I have right now is as such:
from itertools import combinations
import Levenshtein as lev # a package that wraps a C-implemented levenshtein distance
import networkx as nx
strings = [...list of strings...]
G = nx.Graph()
for (s1, s2) in combinations(strings, 2):
if s1 not in G.nodes():
G.add_node(s1)
if s2 not in G.nodes():
G.add_node(s2)
if lev.distance(s1, s2) == 1:
G.add_edge(s1, s2)
There's clearly no way to improve the computational complexity of the graph construction procedure - it will always be O(n**2)
. At least, in my limited knowledge, that's what I think - perhaps I'm wrong?
That said, given the normal scale of the number of comparisons that I need to make (~approx. 2000-5000), if I can get a few orders of magnitude speedup, overall, then the real-world computing time would be still acceptable - with the current Python implementation, it takes ~days to construct the graph. With the correct imports (not stated below), I have tried a Cython implementation below, but could not figure out how to make it faster:
cpdef cython_genotype_network(sequences):
G = nx.Graph()
cdef:
unicode s1
unicode s2
for (s1, s2) in combinations(sequences, 2):
if lev.distance(s1, s2) == 1:
G.add_edge(s1, s2)
return G
Particularly, Cython expects bytes
, not str
for s1
and s2
. That block of code throws an error.
So... I come to my two questions:
numpy
instead? It's easy to convert from a numpy
matrix to NetworkX; however, I can't seem to figure out how to broadcast the Levenshtein distance function across an n-by-n empty matrix where the each row and column corresponds to a string.UPDATE 1: How to generate sample data
To generate strings:
from random import choice
def create_random_nucleotides_python(num_nucleotides):
"""
Creates random nucleotides of length num_nucleotides.
"""
sequence = ''
letters = ['A', 'T', 'G', 'C']
for i in range(num_nucleotides):
sequence = sequence + choice(letters)
return sequence
def mutate_random_position(string):
"""
Mutates one position in the nucleotide sequence at random.
"""
positions = [i for i in range(len(string))]
pos_to_mut = choice(positions)
letters = ['A', 'T', 'G', 'C']
new_string = ''
for i, letter in enumerate(string):
if i == pos_to_mut:
new_string = new_string + choice(letters)
else:
new_string = new_string + letter
return new_string
# Create 100 Python srings by mutating a first sequence, then randomly choosing stuff to mutate a single position.
base_sequence = create_random_nucleotides_python(1000)
sequences = [base_sequence]
for i in range(99):
sequence = choice(sequences)
mutseq = mutate_random_position(sequence)
sequences.append(mutseq)
Regarding complexity:
You are considering each pair of strings. You don't need that. You can consider all 1-distance strings for each strings:
# I would start by populating the whole graph:
for s1 in strings:
if s1 not in G.nodes():
G.add_node(s1)
# Then compute the leven-1:
for s1 in strings:
for s2 in variations(s1):
if s2 in G.nodes():
G.add_edge(s1, s2)
Now all your need is a variations(string)
shorter than O(n)
:
This returns all variations with a distance of 1. (only 1 edit|delete|insert)
def variations(string):
for i in range(len(string)):
# delete
yield string[:i] + string[i+1:]
# edit
for l in 'ATGC':
yield string[:i] + l + string[i+1:]
# insert
for l in 'ATGC':
yield string[:i] + l + string[i:]
# insert at the end
for l in 'ATGC':
yield string + l
Now, the complexity of that is O(m^2)
(because of the string concat), where m
is the size of the longest sequence. If it is known, it is a constant, and all that is now O(1)
If the sequences are all the same size, you can compute only the edits.
Else, you can sort your sequences from largest to smallest, and only compute edits and deletes.
Or, you can sort your strings by size, and instead of comparing all strings with all other, compare those that have a size difference of <= 1.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With