Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compare 5000 strings with PHP Levenshtein

I have 5000, sometimes more, street address strings in an array. I'd like to compare them all with levenshtein to find similar matches. How can I do this without looping through all 5000 and comparing them directly with every other 4999?

Edit: I am also interested in alternate methods if anyone has suggestions. The overall goal is to find similar entries (and eliminate duplicates) based on user-submitted street addresses.

like image 523
phirschybar Avatar asked Dec 24 '09 11:12

phirschybar


3 Answers

I think a better way to group similar addresses would be to:

  1. create a database with two tables - one for the address (and a id), one for the soundexes of words or literal numbers in the address (with the foreign key of the addresses table)

  2. uppercase the address, replace anything other than [A-Z] or [0-9] with a space

  3. split the address by space, calculate the soundex of each 'word', leave anything with just digits as is and store it in the soundexes table with the foreign key of the address you started with

  4. for each address (with id $target) find the most similar addresses:

    SELECT similar.id, similar.address, count(*) 
    FROM adress similar, word cmp, word src
    WHERE src.address_id=$target
    AND src.soundex=cmp.soundex
    AND cmp.address_id=similar.id
    ORDER BY count(*)
    LIMIT $some_value;
    
  5. calculate the levenstein difference between your source address and the top few values returned by the query.

(doing any sort of operation on large arrays is often faster in databases)

like image 92
symcbean Avatar answered Oct 13 '22 18:10

symcbean


I think you cannot avoid looping through the array as the levenstein() function takes only strings and not an array as input.

You can do something like:

for($i=0;$i<count($array)-1;$i++)
{
    for($j=$i+1;$j<count($array);$j++)
    {
        $lev = levenshtein($array[$i],$array[$j]);
        if($lev == 0)
        {
            // exact match
        }
        else if($lev <= THRESHOLD)
        {
            // similar
        }
    }
}
like image 3
codaddict Avatar answered Oct 13 '22 19:10

codaddict


You can use a bk-tree to speed-up the search/comparison.

http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees says:

Now we can make a particularly useful observation about the Levenshtein Distance: It forms a Metric Space.
[...]
Assume for a moment we have two parameters, query, the string we are using in our search, and n the maximum distance a string can be from query and still be returned. Say we take an arbitary string, test and compare it to query. Call the resultant distance d. Because we know the triangle inequality holds, all our results must have at most distance d+n and at least distance d-n from test.
[...]
Tests show that searching with a distance of 1 queries no more than 5-8% of the tree, and searching with two errors queries no more than 17-25% of the tree - a substantial improvement over checking every node!

edit: But that doesn't help you with your ("12 Bird Road, Apt 6" and "12 Bird Rd. #6") problem. Only with your brute-force m*n comparison.

like image 3
VolkerK Avatar answered Oct 13 '22 19:10

VolkerK