I'm working with a large dataset of products(~1 million). These products come from many different sources and thus the way they have data listed in inconsistent. One of the big issues is variances product Brand names (~17,000 unique brands). Some brands have as many as 10 variances that need to be related together.
Issues:
Example Dataset
Black Diamond
Black Diamond (Uda)
Black Diamond Co
Black Diamond Eq Ltd
Black Diamond Eqp #76800
Black Diamond Equipment
Black Dog Machine Llc
Black Dome Press
Black Dot
Black Dragon
Black Fire
Black Flys
Black Forest Girl
Black Gold
Black Hawk Inc.
Black Hills
Black Knight
Black Label
Black Magic
Black Marine
Black Market Bikes
Black Max
Black Opal
Black Ops
Black Rain Ordance Inc.
Black Rain Ordnance
Black Rapid
Black Ribbon
Black Rifle Disease Engineerin
Black River Bucks
Black Seal
Black Seed
Black Swan
Black Tower
Black Widow
Black's
Consequences (as suggested in a comment)
I realize that is is a large problem and likely beyond the scope of what can be resolved in a stack overflow article, but I'm looking for inspirations on how to tackle this problem.
Any algorithm, software pattern, or process that may help is welcome.
Well, the way I would approach this would be to use some distance metric to quantify similarity between phrases and then cluster the terms by their distances.
You can start with a classical text metric like Levenshtein distance (you will find many implementations easily), which is basically the edit distance, or the number of operations you need to get from one string to another string, where an operation can be a substitution, insertion or deletion.
From the examples you gave, it seems that Levenshtein would be reasonable.
For clustering there are tons of algorithms, again this is easy to google and find tons of implementations. Clustering basically finds groups (clusters) of objects that are close to each other under a certain distance metric. In your case these would be groups of terms that are similar to each other.
Once you see the results, you can try playing with your distance metric a little bit by making manual adjustments using your knowledge about the data (like specifying that "&" is close to "and", etc).
Good luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With