Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best way to compare data when importing to database?

I have a MySQL database table containing information on around 1000 shops. Now I will be importing more shops through uploading an Excel spread sheet, and I am trying to avoid duplicates.

  • Shops may have the same name, but never the same address.
  • Shops may have the same address, but never the same name

But here is my problem.

  • Stores may be misspelled
  • Address may be misspelled

Currently I'm importing the data to a temporary table. Now I'm wondering what is the best approach for comparing the imported shops with the ones already existing.

My plan is to go through each row and compare the shops.

  • First compare a.name = b.name AND a.street = b.street. On match, shop is deleted.
  • Then I will do a Levenshtein comparison on name and street. Here I probably will have to manually look at the results to determine if it's a duplicate.

Does anyone have experince with this sort of data comparison?

Update
Thanks for good answers.

Fields that will be used for comparison are:

  • name
  • street address
  • zip code
  • city
  • Country

I'm thinking something along these lines:

Select rows where name = Lavenshtein and country = country.
That way I only have to work with a small list.

Then I can start comparing name and address more thoroughly.

like image 600
Steven Avatar asked Jul 28 '11 08:07

Steven


1 Answers

Levenshtein-distance is the way to do it, and you can avoid manual input. But the actual implementation will depend on some sort of prior knowledge about the data. Like how much error do you expect in the spellings.

Suppose for example its a good quality data, and you only expecting typos, you can generate a matching condition based on, 1) are number of words same? 2) sequence of those words 3) A small threshold on allowed error in Levenshtein-distance for each word in the name.

The conditions can be reinforced, by checking against address with similar condition when there is ambiguity in name or visa-versa.

like image 183
Shaunak Avatar answered Sep 28 '22 07:09

Shaunak