I have a MySQL database table containing information on around 1000 shops. Now I will be importing more shops through uploading an Excel spread sheet, and I am trying to avoid duplicates.
But here is my problem.
Currently I'm importing the data to a temporary table. Now I'm wondering what is the best approach for comparing the imported shops with the ones already existing.
My plan is to go through each row and compare the shops.
Does anyone have experince with this sort of data comparison?
Update
Thanks for good answers.
Fields that will be used for comparison are:
I'm thinking something along these lines:
Select rows where name = Lavenshtein and country = country.
That way I only have to work with a small list.
Then I can start comparing name and address more thoroughly.
Levenshtein-distance is the way to do it, and you can avoid manual input. But the actual implementation will depend on some sort of prior knowledge about the data. Like how much error do you expect in the spellings.
Suppose for example its a good quality data, and you only expecting typos, you can generate a matching condition based on, 1) are number of words same? 2) sequence of those words 3) A small threshold on allowed error in Levenshtein-distance for each word in the name.
The conditions can be reinforced, by checking against address with similar condition when there is ambiguity in name or visa-versa.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With