Record Matching algorithms for an inconsistent dataset

Question

I'm working with a large dataset of products(~1 million). These products come from many different sources and thus the way they have data listed in inconsistent. One of the big issues is variances product Brand names (~17,000 unique brands). Some brands have as many as 10 variances that need to be related together.

Issues:

Inconsistant Spacing: Jet Boil VS Jetboil
Punctuation: Granger's VS Grangers
Noise Words: The North Face VS North Face
Taxomonies: Armada VS Armada Skis
Symbols: Phil and Teds VS Phil&Teds
Mis-spelling: Patagonia VS Pategonia
Other Oddities: Bell Sports VS Bell Sports #81037

Example Dataset

Black Diamond
Black Diamond (Uda)
Black Diamond Co
Black Diamond Eq Ltd
Black Diamond Eqp #76800
Black Diamond Equipment
Black Dog Machine Llc
Black Dome Press
Black Dot
Black Dragon
Black Fire
Black Flys
Black Forest Girl
Black Gold
Black Hawk Inc.
Black Hills
Black Knight
Black Label
Black Magic
Black Marine
Black Market Bikes
Black Max
Black Opal
Black Ops
Black Rain Ordance Inc.
Black Rain Ordnance
Black Rapid
Black Ribbon
Black Rifle Disease Engineerin
Black River Bucks
Black Seal
Black Seed
Black Swan
Black Tower
Black Widow
Black's

Consequences (as suggested in a comment)

An incorrect association will result in unrelated brands being displayed in product searches and thus weaken the usability of the presentation layer
Missing an association will result in the same brand being displayed multiple in a filter list and thus weaken the usability of the presentation layer

I realize that is is a large problem and likely beyond the scope of what can be resolved in a stack overflow article, but I'm looking for inspirations on how to tackle this problem.

Any algorithm, software pattern, or process that may help is welcome.

Bitwise · Accepted Answer

Well, the way I would approach this would be to use some distance metric to quantify similarity between phrases and then cluster the terms by their distances.

You can start with a classical text metric like Levenshtein distance (you will find many implementations easily), which is basically the edit distance, or the number of operations you need to get from one string to another string, where an operation can be a substitution, insertion or deletion.

From the examples you gave, it seems that Levenshtein would be reasonable.

For clustering there are tons of algorithms, again this is easy to google and find tons of implementations. Clustering basically finds groups (clusters) of objects that are close to each other under a certain distance metric. In your case these would be groups of terms that are similar to each other.

Once you see the results, you can try playing with your distance metric a little bit by making manual adjustments using your knowledge about the data (like specifying that "&" is close to "and", etc).

Good luck!

Record Matching algorithms for an inconsistent dataset

Tags:

string-matching

algorithm

pattern-matching

NSjonas

1 Answers

Bitwise

Recent Activity

Donate For Us

Record Matching algorithms for an inconsistent dataset

Tags:

string-matching

algorithm

pattern-matching

NSjonas

1 Answers

Bitwise

Related questions

Recent Activity

Donate For Us