The core problem is:
I am looking for an algorithm to calculate a maximum parsimonious distance between a set of strings. With distance I mean something similar to the Damerau–Levenshtein distance i.e. the minimal number of deletions, insertions, substitution and transposition of characters or adjacent characters blocks. But instead of regular strings I want to investigate strings with oriented characters.
Thus, a string could look like:
(A,1) (B,1) (C,1) (D,1)
and possible derivatives could be:
(A,1) (C,0) (B,0) (D,1)
(A,1) (C,1) (B,1) (D,1)
(A,1) (B,0) (C,0) (D,1)
Where A,B,C,D
are character identities and 1 = forward
and 0 = reverse
.
Here, derivative 1. would have the distance 2, since you could cut out block BC and re paste it inverted (1 cut, 1 paste). Derivative 2. would have 2 as well, since you may cut out C and re paste it in front of B (1 cut, 1 Paste) whereas number 3. would require 4 operations (2 cuts, 2 pastes) to be transformed. Analogous, deletion or insertion of blocks would yield a distance 1.
If you would define (X,0)
and (X,1)
as two different non oriented characters (X0, X1)
for all possible X, example 3. would result in a distance of 2 since you could then cut out the block B1C1
and insert the block B0C0
in two steps.
A real world example:
The genes in a bacterial genome could be considered the oriented character (A,0), (B,0)... By determining the sequence distance, the genomic orientation of homologue genes in two related bacteria could be used as a evolutionary marker trace. The fact that bacterial genomes are circular strings introduces the additional border condition ABC is equal to BCA.
Real genomes do have unique genes with no equivalent in a partner thus giving rise to a place holder character @. Those place holders reduce the information content of the comparison to a lower bound, since e.g. (A,1)(B,1)@(C,1) can be transformed to (A,1)@@@(B,1)@(C,1) by insertion of the block @@@. However the orientation restores partially the information content since you may find (A,1)@@@(B,0)@(C,1) indicating a minimum distance of 3. Even better would be an algorithm to compare multiple related sequences (genomes) simultaneously, since you could then find intermediates in the evolutionary history, which increases resolution.
I realize, there are several questions already posted on text string comparison. But they fail to be easily expandable to include the orientation. In addition, there exists a wealth of methods to deal with biological sequences, in particular for multiple sequence analysis. However, those are limited to macromolecule sequences which do not exist in alternate orientations and usually invoke specific weights for any particular character match.
If there exists already a python library which would allow for the necessary customization to solve that problem, that would be fantastic. But any suitable orientation aware algorithm would be very helpful.
I believe the following code can help you out:
from itertools import permutations
from random import randint
from pprint import pprint
def generate_genes():
"""
Generates a boilerplate list of genes
@rtype : list
"""
tuple_list = []
for i in range(16):
binary_var = bin(i)[2:]
if len(binary_var) != 4:
binary_var = "0" * (4 - len(binary_var)) + binary_var
tuple_list.append([('A', (1 if binary_var[0] == '1' else 0)),
('B', (1 if binary_var[1] == '1' else 0)),
('C', (1 if binary_var[2] == '1' else 0)),
('D', (1 if binary_var[3] == '1' else 0))])
return tuple_list
def all_possible_genes():
""" Generates all possible combinations of ABCD genes
@return: returns a list of combinations
@rtype: tuple
"""
gene_list = generate_genes()
all_possible_permutations = []
for gene in gene_list:
all_possible_permutations.append([var for var in permutations(gene)])
return all_possible_permutations
def gene_stringify(gene_tuple):
"""
@type gene_tuple : tuple
@param gene_tuple: The gene tuple generated
"""
return "".join(str(var[0]) for var in gene_tuple if var[1])
def dameraulevenshtein(seq1, seq2):
"""Calculate the Damerau-Levenshtein distance between sequences.
This distance is the number of additions, deletions, substitutions,
and transpositions needed to transform the first sequence into the
second. Although generally used with strings, any sequences of
comparable objects will work.
Transpositions are exchanges of *consecutive* characters; all other
operations are self-explanatory.
This implementation is O(N*M) time and O(M) space, for N and M the
lengths of the two sequences.
>>> dameraulevenshtein('ba', 'abc')
2
>>> dameraulevenshtein('fee', 'deed')
2
It works with arbitrary sequences too:
>>> dameraulevenshtein('abcd', ['b', 'a', 'c', 'd', 'e'])
2
"""
# codesnippet:D0DE4716-B6E6-4161-9219-2903BF8F547F
# Conceptually, this is based on a len(seq1) + 1 * len(seq2) + 1 matrix.
# However, only the current and two previous rows are needed at once,
# so we only store those.
oneago = None
thisrow = range(1, len(seq2) + 1) + [0]
for x in xrange(len(seq1)):
# Python lists wrap around for negative indices, so put the
# leftmost column at the *end* of the list. This matches with
# the zero-indexed strings and saves extra calculation.
twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
for y in xrange(len(seq2)):
delcost = oneago[y] + 1
addcost = thisrow[y - 1] + 1
subcost = oneago[y - 1] + (seq1[x] != seq2[y])
thisrow[y] = min(delcost, addcost, subcost)
# This block deals with transpositions
if (x > 0 and y > 0 and seq1[x] == seq2[y - 1]
and seq1[x - 1] == seq2[y] and seq1[x] != seq2[y]):
thisrow[y] = min(thisrow[y], twoago[y - 2] + 1)
return thisrow[len(seq2) - 1]
if __name__ == '__main__':
genes = all_possible_genes()
list1 = genes[randint(0, 15)][randint(0, 23)]
list2 = genes[randint(0, 15)][randint(0, 23)]
print gene_stringify(list1)
pprint(list1)
print gene_stringify(list2)
pprint(list2)
print dameraulevenshtein(gene_stringify(list1), gene_stringify(list2))
Michael Homer for the algorithm
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With