Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implementing the Waterman-Eggert algorithm

I am trying to implement the Waterman-Eggert algorithm for finding sub-optimal local sequence alignments, but am struggling to understand how to "declump" separate groups of alignments. I have the basic Smith-Waterman algorithm working fine.

A simple test aligning the following sequence against itself:

'HEAGHEAGHEAG'
'HEAGHEAGHEAG'

produces an fMatrix as follows:

 [[  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.]
  [  0.   8.   0.   0.   0.   8.   0.   0.   0.   8.   0.   0.   0.]
  [  0.   0.  13.   0.   0.   0.  13.   0.   0.   0.  13.   0.   0.]
  [  0.   0.   0.  17.   0.   0.   0.  17.   0.   0.   0.  17.   0.]
  [  0.   0.   0.   0.  23.   0.   0.   0.  23.   0.   0.   0.  23.]
  [  0.   8.   0.   0.   0.  31.   0.   0.   0.  31.   0.   0.   0.]
  [  0.   0.  13.   0.   0.   0.  36.   0.   0.   0.  36.   0.   0.]
  [  0.   0.   0.  17.   0.   0.   0.  40.   0.   0.   0.  40.   0.]
  [  0.   0.   0.   0.  23.   0.   0.   0.  46.   0.   0.   0.  46.]
  [  0.   8.   0.   0.   0.  31.   0.   0.   0.  54.   4.   0.   0.]
  [  0.   0.  13.   0.   0.   0.  36.   0.   0.   4.  59.   9.   0.]
  [  0.   0.   0.  17.   0.   0.   0.  40.   0.   0.   9.  63.  13.]
  [  0.   0.   0.   0.  23.   0.   0.   0.  46.   0.   0.  13.  69.]]

In order to find the sub-optimal alignments, e.g.

'HEAGHEAGHEAG    '
'    HEAGHEAGHEAG'

you have to first remove the optimal alignment (i.e. along the main diagonal) and recalculate the fMatrix; this is known as "declumping", where a "clump" of alignments is defined as any alignments whose paths intersect/share one or more pairs of aligned residues. In addition to the fMatrix, there is a secondary matrix that contains information about the direction in which fMatrix was constructed.

A snippet of the code to build the fMatrix and backtracking matrix is as follows:

# Generates fMatrix.
for i in range(1, length):
    for j in range(1, length):
        matchScore = fMatrix[i-1][j-1] + simMatrixDict[seq[i-1]+seq[j-1]]
        insScore = fMatrix[i][j-1] + gap
        delScore = fMatrix[i-1][j] + gap
        fMatrix[i][j] = max(0, matchScore, insScore, delScore)

        # Generates matrix for backtracking.
        if fMatrix[i][j] == matchScore:
            backMatrix[i][j] = 2      
        elif fMatrix[i][j] == insScore:
            backMatrix[i][j] = 3        # INSERTION in seq - Horizontal
        elif fMatrix[i][j] == delScore:
            backMatrix[i][j] = 1        # DELETION in seq - Vertical

        if fMatrix[i][j] >= backtrackStart: 
            backtrackStart = fMatrix[i][j]
            endCoords = i, j                
return fMatrix, backMatrix, endCoords

In order to remove this optimal alignment I have tried using this backMatrix to backtrack through the fMatrix (as per original Smith-Waterman algorithm) and set fMatrix[i][j] = 0 as I go, but this doesn't remove the entire clump, only the exact alignment in that clump.

For some background information, the Wikipedia page for the Smith-Waterman algorithm explains how fMatrix is constructed and there is an explanation on here about how backtracking works. The Waterman-Eggert algorithm is roughly explained here.

Thanks.

like image 246
jonwells Avatar asked May 17 '13 14:05

jonwells


1 Answers

OK. Here's some code to do what you want. I've used the pretty print library (pprint) so that the output looks nice. (It looks nicest if the numbers in the matrix are single digit, but the alignment gets a little messed up if there are multi-digit numbers.)

How does it work?

Because you only need to change numbers on the main diagonal and and those on the diagonals above and below, we just need one for loop. matrix[i][i] is always on the main diagonal, because it is i rows down, and i columns across. matrix[i][i-1] is always the lower adjacent diagonal, because it is i rows down, and i-1 columns across. matrix[i-1][i] is always the upper adjacent diagonal, because it is i-1 rows down, and i rows across.

#!/usr/bin/python
import pprint

matrix = [
    [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,],
    [  0,   8,   0,   0,   0,   8,   0,   0,   0,   8,   0,   0,   0,],
    [  0,   0,  13,   0,   0,   0,  13,   0,   0,   0,  13,   0,   0,],
    [  0,   0,   0,  17,   0,   0,   0,  17,   0,   0,   0,  17,   0,],
    [  0,   0,   0,   0,  23,   0,   0,   0,  23,   0,   0,   0,  23,],
    [  0,   8,   0,   0,   0,  31,   0,   0,   0,  31,   0,   0,   0,],
    [  0,   0,  13,   0,   0,   0,  36,   0,   0,   0,  36,   0,   0,],
    [  0,   0,   0,  17,   0,   0,   0,  40,   0,   0,   0,  40,   0,],
    [  0,   0,   0,   0,  23,   0,   0,   0,  46,   0,   0,   0,  46,],
    [  0,   8,   0,   0,   0,  31,   0,   0,   0,  54,   4,   0,   0,],
    [  0,   0,  13,   0,   0,   0,  36,   0,   0,   4,  59,   9,   0,],
    [  0,   0,   0,  17,   0,   0,   0,  40,   0,   0,   9,  63,  13,],
    [  0,   0,   0,   0,  23,   0,   0,   0,  46,   0,   0,  13,  69,]]

print "Original Matrix"
pprint.pprint(matrix)
print

for i in range(len(matrix)):
    matrix[i][i] = 0
    if (i > 0) and (i < (len(matrix))):
        matrix[i][i-1] = 0
        matrix[i-1][i] = 0

print "New Matrix"
pprint.pprint(matrix)

Output:

Original Matrix
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 8, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 0],
 [0, 0, 13, 0, 0, 0, 13, 0, 0, 0, 13, 0, 0],
 [0, 0, 0, 17, 0, 0, 0, 17, 0, 0, 0, 17, 0],
 [0, 0, 0, 0, 23, 0, 0, 0, 23, 0, 0, 0, 23],
 [0, 8, 0, 0, 0, 31, 0, 0, 0, 31, 0, 0, 0],
 [0, 0, 13, 0, 0, 0, 36, 0, 0, 0, 36, 0, 0],
 [0, 0, 0, 17, 0, 0, 0, 40, 0, 0, 0, 40, 0],
 [0, 0, 0, 0, 23, 0, 0, 0, 46, 0, 0, 0, 46],
 [0, 8, 0, 0, 0, 31, 0, 0, 0, 54, 4, 0, 0],
 [0, 0, 13, 0, 0, 0, 36, 0, 0, 4, 59, 9, 0],
 [0, 0, 0, 17, 0, 0, 0, 40, 0, 0, 9, 63, 13],
 [0, 0, 0, 0, 23, 0, 0, 0, 46, 0, 0, 13, 69]]

New Matrix
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 8, 0, 0, 0, 8, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 13, 0, 0, 0, 13, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 17, 0, 0, 0, 17, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 23, 0, 0, 0, 23],
 [0, 8, 0, 0, 0, 0, 0, 0, 0, 31, 0, 0, 0],
 [0, 0, 13, 0, 0, 0, 0, 0, 0, 0, 36, 0, 0],
 [0, 0, 0, 17, 0, 0, 0, 0, 0, 0, 0, 40, 0],
 [0, 0, 0, 0, 23, 0, 0, 0, 0, 0, 0, 0, 46],
 [0, 8, 0, 0, 0, 31, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 13, 0, 0, 0, 36, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 17, 0, 0, 0, 40, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 23, 0, 0, 0, 46, 0, 0, 0, 0]]
like image 116
daviewales Avatar answered Sep 21 '22 09:09

daviewales