Algorithms: Interesting diffing algorithm

Question

This came up in a real-world situation, and I thought I would share it, as it could lead to some interesting solutions. Essentially, the algorithm needs to diff two lists, but let me give you a more rigorous definition of the problem.

Mathematical Formulation

Suppose you have two lists, L and R each of which contain elements from some underlying alphabet S. Moreover, these lists have the property that the common elements that they have appear in order: that is to say, if L[i] = R[i*] and L[j] = R[j*], and i < j then i* < j*. The lists need not have any common elements at all, and one or both may be empty. [Clarification: You may assume no repetitions of elements.]

The problem is to produce a sort of "diff" of the lists, which may be viewed as new list of ordered pairs (x,y) where x is from L and y is from R, with the following properties:

If x appears in both lists, then (x,x) appears in the result.
If x appears in L, but not in R, then (x,NULL) appears in the result.
If y appears in R, but not in L, then (NULL,y) appears in the result.

and finally

The result list has "the same" ordering as each of the input lists: it shares, roughly speaking, the same ordering property as above with each of the lists individually (see example).

Examples

L = (d)
R = (a,b,c)
Result = ((NULL,d), (a,NULL), (b,NULL), (c,NULL))

L = (a,b,c,d,e)  
R = (b,q,c,d,g,e)
Result = ((a,NULL), (b,b), (NULL,q), (c,c), (d,d), (NULL,g), (e,e))

Does anyone have any good algorithms to solve this? What is the complexity?

Mark Ransom · Accepted Answer

There is a way to do this in O(n), if you're willing to make a copy of one of the lists in a different data structure. This is a classic time/space tradeoff.

Create a hash map of the list R, with the key being the element and the value being the original index into the array; in C++, you could use unordered_map from tr1 or boost.

Keep an index to the unprocessed portion of list R, initialized to the first element.

For each element in list L, check the hash map for a match in list R. If you do not find one, output (L value, NULL). If there is a match, get the corresponding index from the hash map. For each unprocessed element in list R up to the matching index, output (NULL, R value). For the match, output (value, value).

When you have reached the end of list L, go through the remaining elements of list R and output (NULL, R value).

Edit: Here is the solution in Python. To those who say this solution depends on the existence of a good hashing function - of course it does. The original poster may add additional constraints to the question if this is a problem, but I will take an optimistic stance until then.

def FindMatches(listL, listR):
    result=[]
    lookupR={}
    for i in range(0, len(listR)):
        lookupR[listR[i]] = i
    unprocessedR = 0
    for left in listL:
        if left in lookupR:
            for right in listR[unprocessedR:lookupR[left]]:
                result.append((None,right))
            result.append((left,left))
            unprocessedR = lookupR[left] + 1
        else:
            result.append((left,None))
    for right in listR[unprocessedR:]:
        result.append((None,right))
    return result

>>> FindMatches(('d'),('a','b','c'))
[('d', None), (None, 'a'), (None, 'b'), (None, 'c')]
>>> FindMatches(('a','b','c','d','e'),('b','q','c','d','g','e'))
[('a', None), ('b', 'b'), (None, 'q'), ('c', 'c'), ('d', 'd'), (None, 'g'), ('e','e')]

Brian · Answer

The worst case, as defined and using only equality, must be O(n*m). Consider the following two lists:

A[] = {a,b,c,d,e,f,g}

B[] = {h,i,j,k,l,m,n}

Assume there exists exactly one match between those two "ordered" lists. It will take O(n*m) comparisons since there does not exist a comparison which removes the need for other comparisons later.

So, any algorithm you come up with is going to be O(n*m), or worse.

Mike Pone · Answer

Diffing ordered lists can be done in linear time by traversing both lists and matching as you go. I will try to post some psuedo Java code in an update.

Since we don't know the ordering algorithm and can't determine any ordering based on less than or greater than operators, we must consider the lists unordered. Also, given how the results are to be formatted you are faced with scanning both lists (at least until you find a match and then you can bookmark and start from there again). It will still be O(n^2) performance, or yes more specifically O(nm).

Nixuz · Answer

This is exactly like sequence alignment, you can use the Needleman-Wunsch algorithm to solve it. The link includes the code in Python. Just make sure you set the scoring so that a mismatch is negative and a match is positive and an alignment with a blank is 0 when maximizing. The algorithm runs in O(n * m) time and space, but the space complexity of this can be improved.

Scoring Function

int score(char x, char y){
    if ((x == ' ') || (y == ' ')){
        return 0;
    }
    else if (x != y){
        return -1;
    }
    else if (x == y){
        return 1;
    }
    else{
        puts("Error!");
        exit(2);
    }
}

Code

#include <stdio.h>
#include <stdbool.h>

int max(int a, int b, int c){
    bool ab, ac, bc;
    ab = (a > b);
    ac = (a > c);
    bc = (b > c);
    if (ab && ac){
        return a;
    }
    if (!ab && bc){
        return b;
    }
    if (!ac && !bc){
        return c;
    }
}

int score(char x, char y){
    if ((x == ' ') || (y == ' ')){
        return 0;
    }
    else if (x != y){
        return -1;
    }
    else if (x == y){
        return 1;
    }
    else{
        puts("Error!");
        exit(2);
    }
}


void print_table(int **table, char str1[], char str2[]){
    unsigned int i, j, len1, len2;
    len1 = strlen(str1) + 1;
    len2 = strlen(str2) + 1;
    for (j = 0; j < len2; j++){
        if (j != 0){
            printf("%3c", str2[j - 1]);
        }
        else{
            printf("%3c%3c", ' ', ' ');
        }
    }
    putchar('
');
    for (i = 0; i < len1; i++){
        if (i != 0){
            printf("%3c", str1[i - 1]);
        }
        else{
            printf("%3c", ' ');
        }
        for (j = 0; j < len2; j++){
            printf("%3d", table[i][j]);
        }
        putchar('
');
    }
}

int **optimal_global_alignment_table(char str1[], char str2[]){
    unsigned int len1, len2, i, j;
    int **table;
    len1 = strlen(str1) + 1;
    len2 = strlen(str2) + 1;
    table = malloc(sizeof(int*) * len1);
    for (i = 0; i < len1; i++){
        table[i] = calloc(len2, sizeof(int));
    }
    for (i = 0; i < len1; i++){
        table[i][0] += i * score(str1[i], ' ');
    }
    for (j = 0; j < len1; j++){
        table[0][j] += j * score(str1[j], ' ');
    }
    for (i = 1; i < len1; i++){
        for (j = 1; j < len2; j++){
            table[i][j] = max(
                table[i - 1][j - 1] + score(str1[i - 1], str2[j - 1]),
                table[i - 1][j] + score(str1[i - 1], ' '),
                table[i][j - 1] + score(' ', str2[j - 1])
            );
        }
    }
    return table;
}

void prefix_char(char ch, char str[]){
    int i;
    for (i = strlen(str); i >= 0; i--){
        str[i+1] = str[i];
    }   
    str[0] = ch;
}

void optimal_global_alignment(int **table, char str1[], char str2[]){
    unsigned int i, j;
    char *align1, *align2;
    i = strlen(str1);
    j = strlen(str2);
    align1 = malloc(sizeof(char) * (i * j));
    align2 = malloc(sizeof(char) * (i * j));
    align1[0] = align2[0] = '\0';
    while((i > 0) && (j > 0)){
        if (table[i][j] == (table[i - 1][j - 1] + score(str1[i - 1], str2[j - 1]))){
            prefix_char(str1[i - 1], align1);
            prefix_char(str2[j - 1], align2);
            i--;
            j--;
        }
        else if (table[i][j] == (table[i - 1][j] + score(str1[i-1], ' '))){
            prefix_char(str1[i - 1], align1);
            prefix_char('_', align2);
            i--;
        }
        else if (table[i][j] == (table[i][j - 1] + score(' ', str2[j - 1]))){
            prefix_char('_', align1);
            prefix_char(str2[j - 1], align2);
            j--;
        }
    }
    while (i > 0){
        prefix_char(str1[i - 1], align1);
        prefix_char('_', align2);
        i--;
    }
    while(j > 0){
        prefix_char('_', align1);
        prefix_char(str2[j - 1], align2);
        j--;
    }
    puts(align1);
    puts(align2);
}

int main(int argc, char * argv[]){
    int **table;
    if (argc == 3){
        table = optimal_global_alignment_table(argv[1], argv[2]);
        print_table(table, argv[1], argv[2]);
        optimal_global_alignment(table, argv[1], argv[2]);
    }
    else{
        puts("Reqires to string arguments!");
    }
    return 0;
}

Sample IO

$ cc dynamic_programming.c && ./a.out aab bba
__aab
bb_a_
$ cc dynamic_programming.c && ./a.out d abc
___d
abc_
$ cc dynamic_programming.c && ./a.out abcde bqcdge
ab_cd_e
_bqcdge

Algorithms: Interesting diffing algorithm

Tags:

language-agnostic

algorithm

list

diff

Mathematical Formulation

Examples

Jake

4 Answers

Mark Ransom

Brian

Mike Pone

Nixuz

Recent Activity

Donate For Us

Algorithms: Interesting diffing algorithm

Tags:

language-agnostic

algorithm

list

diff

Mathematical Formulation

Examples

Jake

4 Answers

Mark Ransom

Brian

Mike Pone

Nixuz

Related questions

Recent Activity

Donate For Us