Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an edit distance algorithm that takes "chunk transposition" into account?

I put "chunk transposition" in quotes because I don't know whether or what the technical term should be. Just knowing if there is a technical term for the process would be very helpful.

The Wikipedia article on edit distance gives some good background on the concept.

By taking "chunk transposition" into account, I mean that

Turing, Alan.

should match

Alan Turing

more closely than it matches

Turing Machine

I.e. the distance calculation should detect when substrings of the text have simply been moved within the text. This is not the case with the common Levenshtein distance formula.

The strings will be a few hundred characters long at most -- they are author names or lists of author names which could be in a variety of formats. I'm not doing DNA sequencing (though I suspect people that do will know a bit about this subject).

like image 535
Steven Huwig Avatar asked May 18 '09 14:05

Steven Huwig


People also ask

How does edit distance algorithm work?

In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to transform one string into the other.

Is edit distance and Levenshtein distance same?

The Levenshtein distance (a.k.a edit distance) is a measure of similarity between two strings. It is defined as the minimum number of changes required to convert string a into string b (this is done by inserting, deleting or replacing a character in string a ).

Is edit distance NP complete?

The main results presented here are: The problem of edit distance with moves is NP-complete. A “recursive” sequence of moves can be simulated with at most a constant factor increase by a non-recursive sequence.

How do you calculate edit distance?

Delete 'm'th character of str1 and compute edit distance between 'm-1' characters of str1 and 'n' characters of str2. For this computation, we simply have to do - (1 + array[m-1][n]) where 1 is the cost of delete operation and array[m-1][n] is edit distance between 'm-1' characters of str1 and 'n' characters of str2.


1 Answers

In the case of your application you should probably think about adapting some algorithms from bioinformatics.

For example you could firstly unify your strings by making sure, that all separators are spaces or anything else you like, such that you would compare "Alan Turing" with "Turing Alan". And then split one of the strings and do an exact string matching algorithm ( like the Horspool-Algorithm ) with the pieces against the other string, counting the number of matching substrings.

If you would like to find matches that are merely similar but not equal, something along the lines of a local alignment might be more suitable since it provides a score that describes the similarity, but the referenced Smith-Waterman-Algorithm is probably a bit overkill for your application and not even the best local alignment algorithm available.

Depending on your programming environment there is a possibility that an implementation is already available. I personally have worked with SeqAn lately, which is a bioinformatics library for C++ and definitely provides the desired functionality.

Well, that was a rather abstract answer, but I hope it points you in the right direction, but sadly it doesn't provide you with a simple formula to solve your problem.

like image 142
Paul Avatar answered Oct 21 '22 05:10

Paul