Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Similar names in a huge list

I have database of 50 000 + companies that is constantly updated (200+ per month).

The is a huge issue with repeated content because the names are not always strict/correct:
"Super 1 Store"
"Super One Store"
"Super 1 Stores"

Edit: another example .. which probably needs different approach:
"Amy's Pizza" <---> "Organic Pizza by Amy and Company"

We need tool to scan the data for similar names. I have some experience with Levenshtein Distance and LCS but they work nice for comparing if 2 strings are similar ...
Here I have to scan 50 000 names may be each-with-each and calculate there ... overall similarity rating ...

I need advice how to attack this problem the expected results is to have a list with 10-20 groups of very similar names, and may be further adjust the sensitivity for more results.

like image 709
d.raev Avatar asked Nov 26 '13 08:11

d.raev


People also ask

How do I get a list of names in Excel?

You can find a named range by using the Go To feature—which navigates to any named range throughout the entire workbook. You can find a named range by going to the Home tab, clicking Find & Select, and then Go To. Or, press Ctrl+G on your keyboard.

How do you name an object in a list in R?

The list can be created using list() function in R. Named list is also created with the same function by specifying the names of the elements to access them. Named list can also be created using names() function to specify the names of elements after defining the list.


2 Answers

I had similar problem a year ago or so, and if i remember well, i solved (more or less) using similar_text and soundex as other people said in comments. Something like this:

<?php

$str1 = "Store 1 for you";
$str2 = "Store One 4 You";

similar_text(soundex($str1), soundex($str2), $percent);

if ($percent >= 66){
    echo "Equal";
    //Send an email for review
}else{
    echo "Different";
    //Proceed to insert in database
}
?>

In my case use a percent of 66% to determine the companies are the same (in this case do not insert into database but send an email to me to review, and check if is correct).

After some months with this solutions, i decide to use some kind of unique code for the companies (CIF in my case because is unique by company here in Spain).

like image 142
Sal00m Avatar answered Oct 22 '22 18:10

Sal00m


purely it depend on how much should we tolerate to consider 2 strings as similar.. soundex can be useful as well

select soundex('Super One Store') returns S165236
    select soundex('Super 1 Store'); returns S16236
    select soundex('Super One Stores') returns S1652362

S16236 IS COMMON IN ALL case , you can use filter like below

select * from (
select 'Super One Store' as c 
union
select 'Super 1 Store' as c
union
select 'Super One Stores' as c
union
select  'different one' as c
union 
select  'supers stores' as c
) tmp
where soundex(c) like CONCAT('%', soundex('Super store'), '%')
or soundex(c) like CONCAT('%', soundex('Super one store'), '%')
like image 30
sumit Avatar answered Oct 22 '22 16:10

sumit