Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast(er) algorithm for the Length of the Longest Common Subsequence (LCS)

Problem: Need the Length of the LCS between two strings. The size of the strings is at most 100 characters. The alphabet is the usual DNA one, 4 characters "ACGT". The dynamic approach is not quick enough.

My problem is that I am dealing with lot's and lot's of pairs (of the rank of hundreds of million as far as I can see). I believe I have decreased the calling of the LCS_length function to the minimum possible so the only other way to make my program run faster is to have a more efficient LCS_Length function.

I have started off by implementing in the usual dynamic programming approach. That gives the correct answer and is hopefully implemented in properly.

#define arrayLengthMacro(a) strlen(a) + 1
#define MAX_STRING 101

static int MaxLength(int lengthA, int lengthB);

/* 
 * Then the two strings are compared following a dynamic computing
 * LCS table algorithm. Since we only require the length of the LCS 
 * we can get this rather easily.
 */
int LCS_Length(char *a, char *b)
{
    int lengthA = arrayLengthMacro(a),lengthB = arrayLengthMacro(b), 
        LCS = 0, i, j, maxLength, board[MAX_STRING][MAX_STRING];

        maxLength = MaxLength(lengthA, lengthB);

    //printf("%d %d\n", lengthA, lengthB);
    for (i = 0; i < maxLength - 1; i++)
    {
        board[i][0] = 0;
        board[0][i] = 0;
    }

    for (i = 1; i < lengthA; i++)
    {
        for (j = 1; j < lengthB; j++)
        {
/* If a match is found we allocate the number in (i-1, j-1) incremented  
 * by 1 to the (i, j) position
 */
            if (a[i - 1] == b[j - 1])
            {

                board[i][j] = board[i-1][j-1] + 1;
                if(LCS < board[i][j])
                {
                    LCS++;
                }
            }
            else
            {
                if (board[i-1][j] > board[i][j-1])
                {
                    board[i][j] = board[i-1][j];
                }
                else
                {
                    board[i][j] = board[i][j-1];
                }
            }
        }
    }

    return LCS;
}

That should be O(mn) (hopefully).

Then in search for speed I found this post Longest Common Subsequence Which gave the O(ND) paper by Myers. I tried this which relates the LCS with the shortest edit script (SES). The relation they give is D = M + N - 2L, where D is the length of the SES, M and N are the lengths of the two strings and L is the LCS length. But this isn't always correct in my implementation. I give my implementation (which I think is correct):

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define arrayLengthMacro(a) strlen(a)

int LCS_Length (char *a, char *b);
int MinLength (int A, int B);
int Max (int A, int B);
int snake(int k, int max, char *a, char *b, int lengthA, int lengthB);

int main(void)
{
    int L;
    char a[] = "tomato", b[] = "potato"; //must give LCS = 4
    L =  LCS_Length(a, b);
    printf("LCS: %d\n", L );    
    char c[] = "GTCGTTCGGAATGCCGTTGCTCTGTAAA", d[] = "ACCGGTCGAGTGCGCGGAAGCCGGCCGAA"; // must give LCS = 20
    L =  LCS_Length(c, d);
    printf("LCS: %d\n", L );
    char e[] = "human", f[] = "chimpanzee"; // must give LCS = 4
    L =  LCS_Length(e, f);
    printf("LCS: %d\n", L );
    char g[] = "howareyou", h[] = "whoareyou"; // LCS =8
    L =  LCS_Length(g, h);
    printf("LCS: %d\n", L );
    char i[] = "TTCTTTCGGTAACGCCTACTTTATGAAGAGGTTACATTGCAATCGGGTAAATTAACCAACAAGTAATGGTAGTTCCTAGTAGAGAAACCCTCCCGCTCAC", 
        j[] = "GCACGCGCCTGTTGCTACGCTCTGTCCATCCTTTGTGTGCCGGGTACTCAGACCGGTAACTCGAGTTGCTATCGCGGTTATCAGGATCATTTACGGACTC"; // 61
    L =  LCS_Length(i, j);
    printf("LCS: %d\n", L );


    return 0;
}

int LCS_Length(char *a, char *b)
{

    int D, lengthA = arrayLengthMacro(a), lengthB = arrayLengthMacro(b), 
        max, *V_, *V, i, k, x, y;

    max = lengthA + lengthB;
    V_ = malloc(sizeof(int) * (max+1));
    if(V_ == NULL)
    {
        fprintf(stderr, "Failed to allocate memory for LCS");
        exit(1);
    }
    V = V_ + lengthA;
    V[1] = 0;

    for (D = 0; D < max; D++)
    {
        for (k = -D; k <= D; k = k + 2)
        {
            if ((k == -D && V[k-1] < V[k+1]) || (k != D && V[k-1] < V[k+1]))
            {
                x = V[k+1];
            }
            else
            {
                x = V[k-1] + 1;
            }
            y = x - k;
            while ((x < lengthB) && (y < lengthA) && (a[x+1] == b[y+1]))
            {
                x++;
                y++;
            }
            V[k] = x;
            if ((x >= lengthB) && (y >= lengthA))
            {
                return (lengthA + lengthB - D)/2;
            }
        }
    }
    return  (lengthA + lengthB - D)/2;
}

There are examples in the main. Eg. "tomato" and "potato" (don't comment), have an LCS length of 4. The implementation finds that it is 5 sice the SES (D in the code) comes out as 2 instead of 4 that I'd expect (delete "t", insert "p", delete "m", insert "t"). I am inclined to think that maybe the O(ND) algorithm counts substitutions as well, but I am not sure about this.

Any approach that is implementable (I don't have immense programming skills), is welcome. (If someone would know how to take advantage of the small alphabet for example).

EDIT: I think it would be useful to say on top of everything else, that I use the LCS function between ANY two strings at ANY time. So it is not only string say s1, compared with few million others. It might be s200 with s1000, then s0 with s10000, and then 250 with s100000. Not likely to be able to track most used strings either. I require that the LCS length is NOT an approximation, since I am implementing an approximation algorithm and I don't want to add extra error.

EDIT: Just ran callgrind. For an input of 10000 strings I seem to call the lcs function about 50,000,000 times, for different pairs of strings. (10000 strings is the lowest I want to run and the max is 1 million (if that is feasible)).

like image 757
Yiannis Avatar asked Jul 02 '11 08:07

Yiannis


2 Answers

I'd recommend getting hold of a copy of Gusfield's Algorithms on Strings, Trees and Sequences which is all about string operations for computational biology.

like image 119
borrible Avatar answered Oct 16 '22 17:10

borrible


I'm not familiar with the fancier-than-dynamic-programming algorithms for computing LCS, but I wanted to point out a few things:

First, the O(ND) approach only makes sense if you're comparing very large, very similar strings. This doesn't seem to be the case for you.

Second, speeding up the asymptotic performance of your LCD_Length function is probably not what you should be focusing on since your strings are pretty short. If you only care about finding similar or dissimilar pairs (and not all pairs' exact LCS), then the BK-tree mentioned by Yannick looks like a promising way to go.

Finally, some things bothered me about your DP implementation. The correct interpretation of "board[i][j]" in your code is "the longest subsequence of strings a[1..i], b[1..j]" (I'm using 1-indexing in this notation). Therefore, your for loops should include i = lengthA and j = lengthB. It looks like you hacked around this bug in your code by introducing arrayLengthMacro(a), but that hack doesn't make sense in the context of the algorithm. Once "board" is filled, you can look up the solution in board[lengthA][lengthB], which means you can get rid of the unnecessary "if (LCS < board[i][j])" block and return board[lengthA][lengthB]. Also, the loop bounds look wrong in the initialization (I'm pretty sure it should be for (i = 0; i <= maxLength; i++) where maxLength = max(lengthA, lengthB)).

like image 2
Julian Panetta Avatar answered Oct 16 '22 16:10

Julian Panetta