Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Normalizing a Table with Low Integrity

Tags:

database

I've been handed a table with about 18000 rows. Each record describes the location of one customer. The issue is, that when the person created the table, they did not add a field for "Company Name", only "Location Name," and one company can have many locations.

For example, here are some records that describe the same customer:

Location Table

 ID  Location_Name     
 1   TownShop#1        
 2   Town Shop - Loc 2 
 3   The Town Shop     
 4   TTS - Someplace   
 5   Town Shop,the 3   
 6   Toen Shop4        

My goal is to make it look like:

Location Table

 ID  Company_ID   Location_Name     
 1   1            Town Shop#1       
 2   1            Town Shop - Loc 2 
 3   1            The Town Shop     
 4   1            TTS - Someplace   
 5   1            Town Shop,the 3   
 6   1            Toen Shop4        

Company Table

 Company_ID  Company_Name  
 1           The Town Shop 

There is no "Company" table, I will have to generate the Company Name list from the most descriptive or best Location Name that represents the multiple locations.

Currently I am thinking I need to generate a list of Location Names that are similar, and then and go through that list by hand.

Any suggestions on how I can approach this is appreciated.

@Neall, Thank you for your statement, but unfortunately, each location name is distinct, there are no duplicate location names, only similar. So in the results from your statement "repcount" is 1 in each row.

@yukondude, Your step 4 is the heart of my question.

like image 834
Joe Mako Avatar asked Aug 08 '08 16:08

Joe Mako


1 Answers

Please update the question, do you have a list of CompanyNames available to you? I ask because you maybe able to use Levenshtein algo to find a relationship between your list of CompanyNames and LocationNames.


Update

There is not a list of Company Names, I will have to generate the company name from the most descriptive or best Location Name that represents the multiple locations.

Okay... try this:

  1. Build a list of candidate CompanyNames by finding LocationNames made up of mostly or all alphabetic characters. You can use regular expressions for this. Store this list in a separate table.
  2. Sort that list alphabetically and (manually) determine which entries should be CompanyNames.
  3. Compare each CompanyName to each LocationName and come up with a match score (use Levenshtein or some other string matching algo). Store the result in a separate table.
  4. Set a threshold score such that any MatchScore < Threshold will not be considered a match for a given CompanyName.
  5. Manually vet through the LocationNames by CompanyName | LocationName | MatchScore, and figure out which ones actually match. Ordering by MatchScore should make the process less painful.

The whole purpose of the above actions is to automate parts and limit the scope of your problem. It's far from perfect, but will hopefully save you the trouble of going through 18K records by hand.

like image 79
Jake McGraw Avatar answered Nov 15 '22 04:11

Jake McGraw