Similar names in a huge list

Tags:

I have database of 50 000 + companies that is constantly updated (200+ per month).

The is a huge issue with repeated content because the names are not always strict/correct:
"Super 1 Store"
"Super One Store"
"Super 1 Stores"

Edit: another example .. which probably needs different approach:
"Amy's Pizza" <---> "Organic Pizza by Amy and Company"

We need tool to scan the data for similar names. I have some experience with Levenshtein Distance and LCS but they work nice for comparing if 2 strings are similar ...
Here I have to scan 50 000 names may be each-with-each and calculate there ... overall similarity rating ...

I need advice how to attack this problem the expected results is to have a list with 10-20 groups of very similar names, and may be further adjust the sensitivity for more results.

709

asked Nov 26 '13 08:11

d.raev

2 Answers

I had similar problem a year ago or so, and if i remember well, i solved (more or less) using similar_text and soundex as other people said in comments. Something like this:

Click to copy

<?php

$str1 = "Store 1 for you";
$str2 = "Store One 4 You";

similar_text(soundex($str1), soundex($str2), $percent);

if ($percent >= 66){
    echo "Equal";
    //Send an email for review
}else{
    echo "Different";
    //Proceed to insert in database
}
?>

In my case use a percent of 66% to determine the companies are the same (in this case do not insert into database but send an email to me to review, and check if is correct).

After some months with this solutions, i decide to use some kind of unique code for the companies (CIF in my case because is unique by company here in Spain).

142

answered Oct 22 '22 18:10

Sal00m

purely it depend on how much should we tolerate to consider 2 strings as similar.. soundex can be useful as well

Click to copy

select soundex('Super One Store') returns S165236
    select soundex('Super 1 Store'); returns S16236
    select soundex('Super One Stores') returns S1652362

S16236 IS COMMON IN ALL case , you can use filter like below

Click to copy

select * from (
select 'Super One Store' as c 
union
select 'Super 1 Store' as c
union
select 'Super One Stores' as c
union
select  'different one' as c
union 
select  'supers stores' as c
) tmp
where soundex(c) like CONCAT('%', soundex('Super store'), '%')
or soundex(c) like CONCAT('%', soundex('Super one store'), '%')

answered Oct 22 '22 16:10

sumit

Related questions
                            
                                Why is the second static variable assignment takes effect not the first one?
                            
                                PHP: array of objects - serialize vs json_encode - alternatives?
                            
                                Find out if date is between two dates, ignoring year
                            
                                Create database with PDO bindParam
                            
                                PHP include different version of same library
                            
                                Error with PHPUnit in Symfony2
                            
                                Display table values vertically while keeping table structure
                            
                                Node.js and socket.io for a notification bar : Am I going the right way?
                            
                                How to handle Objects with arrays to access specific data?
                            
                                Using a Service Account, getAccessToken() is returning null
                            
                                Laravel 4 migrate:rollback with --path on artisan CLI
                            
                                URL doesn't work with slash after removing extension using htaccess
                            
                                PHP: mail() function with runtime ini_set() for SMTP and SMTP_PORT not working on Linux
                            
                                How to stream a mjpeg video on a website
                            
                                What does memory_get_peak_usage(true) do? [duplicate]
                            
                                Get possible array combinations
                            
                                Is there a limit like max_input_vars in versions before 5.3.9?
                            
                                MySQL select or update acting very strange
                            
                                htaccess: different rewrite rules for different ip addresses
                            
                                On update, skip certain attributes from updating yii

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Similar names in a huge list

Tags:

php

mysql

data-analysis

d.raev

People also ask

2 Answers

Sal00m

sumit

Recent Activity

Donate For Us