Levenshtein distance: how to better handle words swapping positions?

Tags:

I've had some success comparing strings using the PHP levenshtein function.

However, for two strings which contain substrings that have swapped positions, the algorithm counts those as whole new substrings.

For example:

levenshtein("The quick brown fox", "brown quick The fox"); // 10 differences

are treated as having less in common than:

levenshtein("The quick brown fox", "The quiet swine flu"); // 9 differences

I'd prefer an algorithm which saw that the first two were more similar.

How could I go about coming up with a comparison function that can identify substrings which have switched position as being distinct to edits?

One possible approach I've thought of is to put all the words in the string into alphabetical order, before the comparison. That takes the original order of the words completely out of the comparison. A downside to this, however, is that changing just the first letter of a word can create a much bigger disruption than a changing a single letter should cause.

What I'm trying to achieve is to compare two facts about people which are free text strings, and decide how likely these facts are to indicate the same fact. The facts might be the school someone attended, the name of their employer or publisher, for example. Two records may have the same school spelled differently, words in a different order, extra words, etc, so the matching has to be somewhat fuzzy if we are to make a good guess that they refer to the same school. So-far it is working very well for spelling errors (I am using a phoenetic algorithm similar to metaphone on top of this all) but very poorly if you switch the order of words around which seem common in a school: "xxx college" vs "college of xxx".

475

asked May 06 '09 05:05

thomasrutter

1 Answers

N-grams

Use N-grams, which support multiple-character transpositions across the whole text.

The general idea is that you split the two strings in question into all the possible 2-3 character substrings (n-grams) and treat the number of shared n-grams between the two strings as their similarity metric. This can be then normalized by dividing the shared number by the total number of n-grams in the longer string. This is trivial to calculate, but fairly powerful.

For the example sentences:

A. The quick brown fox B. brown quick The fox C. The quiet swine flu

A and B share 18 2-grams

A and C share only 8 2-grams

out of 20 total possible.

This has been discussed in more detail in the Gravano et al. paper.

tf-idf and cosine similarity

A not so trivial alternative, but grounded in information theory would be to use term term frequency–inverse document frequency (tf-idf) to weigh the tokens, construct sentence vectors and then use cosine similarity as the similarity metric.

The algorithm is:

Calculate 2-character token frequencies (tf) per sentence.
Calculate inverse sentence frequencies (idf), which is a logarithm of a quotient of the number of all sentences in the corpus (in this case 3) divided by the number of times a particular token appears across all sentences. In this case th is in all sentences so it has zero information content (log(3/3)=0).
Produce the tf-idf matrix by multiplying corresponding cells in the tf and idf tables.
Finally, calculate cosine similarity matrix for all sentence pairs, where A and B are weights from the tf-idf table for the corresponding tokens. The range is from 0 (not similar) to 1 (equal).

Levenshtein modifications and Metaphone

Regarding other answers. Damerau–Levenshtein modificication supports only the transposition of two adjacent characters. Metaphone was designed to match words that sound the same and not for similarity matching.

176

answered Oct 02 '22 03:10

Tomasz

Related questions
                            
                                PHP Syntax checking pre-source control
                            
                                Running PHPUnit within a Docker container with PhpStorm
                            
                                Retrieving Facebook / Google+ / Linkedin profile picture having email address only
                            
                                Equivalent of PHP error_log for info logs?
                            
                                How do I verify Android in-app-billing transactions on MY server?
                            
                                phpStorm, do not index a folder / tree [closed]
                            
                                Associative Array versus SplObjectStorage
                            
                                How to mark code as stable using Composer?
                            
                                Where to learn Yii? [closed]
                            
                                What keeps a php session alive?
                            
                                Migration Foreign Key Vs Eloquent Relationships in Laravel
                            
                                Should I allow 'allow_url_fopen' in PHP?
                            
                                How to generate graphs and charts from mysql database in php [closed]
                            
                                How should I organize a general-purpose programming library's directory structure? [closed]
                            
                                Magento Design Patterns
                            
                                Can you hint an array's items type? [duplicate]
                            
                                filemtime "warning stat failed for"
                            
                                What causes an HTTP 405 "invalid method (HTTP verb)" error when POSTing a form to PHP on IIS?
                            
                                Kohana 3: Example of model with validation
                            
                                readdir vs scandir

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Levenshtein distance: how to better handle words swapping positions?

Tags:

algorithm

php

levenshtein-distance

similarity

edit-distance