Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Record Matching algorithms for an inconsistent dataset

I'm working with a large dataset of products(~1 million). These products come from many different sources and thus the way they have data listed in inconsistent. One of the big issues is variances product Brand names (~17,000 unique brands). Some brands have as many as 10 variances that need to be related together.

Issues:


  1. Inconsistant Spacing: Jet Boil VS Jetboil
  2. Punctuation: Granger's VS Grangers
  3. Noise Words: The North Face VS North Face
  4. Taxomonies: Armada VS Armada Skis
  5. Symbols: Phil and Teds VS Phil&Teds
  6. Mis-spelling: Patagonia VS Pategonia
  7. Other Oddities: Bell Sports VS Bell Sports #81037

Example Dataset


Black Diamond
Black Diamond (Uda)
Black Diamond Co
Black Diamond Eq Ltd
Black Diamond Eqp #76800
Black Diamond Equipment
Black Dog Machine Llc
Black Dome Press
Black Dot
Black Dragon
Black Fire
Black Flys
Black Forest Girl
Black Gold
Black Hawk Inc.
Black Hills
Black Knight
Black Label
Black Magic
Black Marine
Black Market Bikes
Black Max
Black Opal
Black Ops
Black Rain Ordance Inc.
Black Rain Ordnance
Black Rapid
Black Ribbon
Black Rifle Disease Engineerin
Black River Bucks
Black Seal
Black Seed
Black Swan
Black Tower
Black Widow
Black's

Consequences (as suggested in a comment)

  • An incorrect association will result in unrelated brands being displayed in product searches and thus weaken the usability of the presentation layer
  • Missing an association will result in the same brand being displayed multiple in a filter list and thus weaken the usability of the presentation layer

I realize that is is a large problem and likely beyond the scope of what can be resolved in a stack overflow article, but I'm looking for inspirations on how to tackle this problem.

Any algorithm, software pattern, or process that may help is welcome.

like image 941
NSjonas Avatar asked Sep 07 '12 20:09

NSjonas


1 Answers

Well, the way I would approach this would be to use some distance metric to quantify similarity between phrases and then cluster the terms by their distances.

You can start with a classical text metric like Levenshtein distance (you will find many implementations easily), which is basically the edit distance, or the number of operations you need to get from one string to another string, where an operation can be a substitution, insertion or deletion.

From the examples you gave, it seems that Levenshtein would be reasonable.

For clustering there are tons of algorithms, again this is easy to google and find tons of implementations. Clustering basically finds groups (clusters) of objects that are close to each other under a certain distance metric. In your case these would be groups of terms that are similar to each other.

Once you see the results, you can try playing with your distance metric a little bit by making manual adjustments using your knowledge about the data (like specifying that "&" is close to "and", etc).

Good luck!

like image 62
Bitwise Avatar answered Nov 15 '22 09:11

Bitwise