I have database of 50 000 + companies that is constantly updated (200+ per month).
The is a huge issue with repeated content because the names are not always strict/correct:
"Super 1 Store"
"Super One Store"
"Super 1 Stores"
Edit: another example .. which probably needs different approach:
"Amy's Pizza" <---> "Organic Pizza by Amy and Company"
We need tool to scan the data for similar names.
I have some experience with Levenshtein Distance and LCS but they work nice for comparing if 2 strings are similar ...
Here I have to scan 50 000 names may be each-with-each and calculate there ... overall similarity rating ...
I need advice how to attack this problem the expected results is to have a list with 10-20 groups of very similar names, and may be further adjust the sensitivity for more results.
You can find a named range by using the Go To feature—which navigates to any named range throughout the entire workbook. You can find a named range by going to the Home tab, clicking Find & Select, and then Go To. Or, press Ctrl+G on your keyboard.
The list can be created using list() function in R. Named list is also created with the same function by specifying the names of the elements to access them. Named list can also be created using names() function to specify the names of elements after defining the list.
I had similar problem a year ago or so, and if i remember well, i solved (more or less) using similar_text
and soundex
as other people said in comments. Something like this:
<?php
$str1 = "Store 1 for you";
$str2 = "Store One 4 You";
similar_text(soundex($str1), soundex($str2), $percent);
if ($percent >= 66){
echo "Equal";
//Send an email for review
}else{
echo "Different";
//Proceed to insert in database
}
?>
In my case use a percent of 66% to determine the companies are the same (in this case do not insert into database but send an email to me to review, and check if is correct).
After some months with this solutions, i decide to use some kind of unique code for the companies (CIF in my case because is unique by company here in Spain).
purely it depend on how much should we tolerate to consider 2 strings as similar.. soundex can be useful as well
select soundex('Super One Store') returns S165236
select soundex('Super 1 Store'); returns S16236
select soundex('Super One Stores') returns S1652362
S16236 IS COMMON IN ALL case , you can use filter like below
select * from (
select 'Super One Store' as c
union
select 'Super 1 Store' as c
union
select 'Super One Stores' as c
union
select 'different one' as c
union
select 'supers stores' as c
) tmp
where soundex(c) like CONCAT('%', soundex('Super store'), '%')
or soundex(c) like CONCAT('%', soundex('Super one store'), '%')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With