Is there an edit distance algorithm that takes "chunk transposition" into account?

Tags:

I put "chunk transposition" in quotes because I don't know whether or what the technical term should be. Just knowing if there is a technical term for the process would be very helpful.

The Wikipedia article on edit distance gives some good background on the concept.

By taking "chunk transposition" into account, I mean that

Turing, Alan.

should match

Alan Turing

more closely than it matches

Turing Machine

I.e. the distance calculation should detect when substrings of the text have simply been moved within the text. This is not the case with the common Levenshtein distance formula.

The strings will be a few hundred characters long at most -- they are author names or lists of author names which could be in a variety of formats. I'm not doing DNA sequencing (though I suspect people that do will know a bit about this subject).

535

asked May 18 '09 14:05

Steven Huwig

1 Answers

In the case of your application you should probably think about adapting some algorithms from bioinformatics.

For example you could firstly unify your strings by making sure, that all separators are spaces or anything else you like, such that you would compare "Alan Turing" with "Turing Alan". And then split one of the strings and do an exact string matching algorithm ( like the Horspool-Algorithm ) with the pieces against the other string, counting the number of matching substrings.

If you would like to find matches that are merely similar but not equal, something along the lines of a local alignment might be more suitable since it provides a score that describes the similarity, but the referenced Smith-Waterman-Algorithm is probably a bit overkill for your application and not even the best local alignment algorithm available.

Depending on your programming environment there is a possibility that an implementation is already available. I personally have worked with SeqAn lately, which is a bioinformatics library for C++ and definitely provides the desired functionality.

Well, that was a rather abstract answer, but I hope it points you in the right direction, but sadly it doesn't provide you with a simple formula to solve your problem.

142

answered Oct 21 '22 05:10

Paul

Related questions
                            
                                Is there a way to keep direction priorities in A*? (ie. Generating the same path as breadth-first)
                            
                                How to figure out "progress" while sorting?
                            
                                Karatsuba Algorithm without BigInteger usage
                            
                                How can race conditions be useful?
                            
                                Find a median of N^2 numbers having memory for N of them
                            
                                Sample an index of a maximal number in an array, with a probability of 1/(number of maximal numbers)
                            
                                How to work out the complexity of the game 2048?
                            
                                Finding overlapping data in arrays
                            
                                Storing pairwise sums in linear space
                            
                                Summation of a number made up of 4 5 6
                            
                                Minimum number of clicks to solve Flood-It-like puzzle
                            
                                What is stratified bootstrap?
                            
                                How can I bundle search terms into more efficient queries?
                            
                                How does Matlab calculate contour lines?
                            
                                std::sort algorithms memory usage
                            
                                General method for calculating Smooth vertex normals with 100% smoothness
                            
                                Compare 2 unordered recordset in memory
                            
                                Issues with understanding Dining table optimal seating algorithm
                            
                                Maximize consumption Energy
                            
                                Understanding Schönhage-Strassen algorithm (huge integer multiplication)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there an edit distance algorithm that takes "chunk transposition" into account?

Tags:

language-agnostic

algorithm

levenshtein-distance

edit-distance

Steven Huwig

People also ask

1 Answers

Paul

Recent Activity

Donate For Us