Comparing and matching product names from different stores/suppliers

Tags:

I’m trying to write a simple program to compare prices for products from different suppliers. Different suppliers may call the same product different things.

For example, the following three strings refer to the same product:

A2 Full Cream Milk Bottle 2l
A2 Milk Full Cream 2L
A2 Full Cream Milk 2L

Or the following two strings are the same product:

Ambi Pur Air Freshener Car Voyage 8mL. Fresh Vanilla Flower fragrance. - 1 each
Ambi Pur Air Freshener Voyage Primary 8ml

Furthermore - some products are not the same, but are similar (for example, Full Cream 2L Milk may encompass various similar products.)

The only bits of information I have on each product are the title, and a price.

What are currently recommended techniques for matching product strings like this?

From my Googling and reading other SO threads, I found:

Some people recommend using Bayesian filtering techniques.
Some recommend doing feature extraction on all the products strings. So you might extract things like brands (e.g. “A2”), Product (“Milk”) and capacity (“2L”) from the products, then create distance vectors between products, and use something like a binary classifier to match products (SVM was mentioned). However, I’m not sure of how to achieve this without a whole bunch of rules or regex? I’m assuming there’s probably smarter unsupervised learning methods of attacking this problem? Price could probably be another “feature” we could use to calculate the distance vector as well.
Some people recommended using neural-network approaches, however, I wasn't able to find much in terms of concrete code or examples here.
Others recommended using string similarity algorithms, such as Levenshtein distance, or the Jaro-Winkler distance.

Would you use one of the above techniques, or would you use a different technique?

Also, does anybody know of any example code, or even libraries for this sort of problem? I couldn't seem to find any.

(For example, I saw that some people were having performance problems with calculating the Jaro-Winkler distance for large data-sets. I was hoping there might be a distributed implementation of the algorithm (e.g. with Mahout), but wasn’t able to find anything concrete.)

437

asked Nov 04 '13 14:11

victorhooi

1 Answers

Would you use one of the above techniques, or would you use a different technique?

If I were doing this for real, I wouldn't use much machine learning. I'm sure most big companies have a database of brand and product names, and use that to match things up fairly easily. Some data sanitation might be needed - but its not much of an ML problem.

If you don't have that database, I'd say go simple. Convert everything to a feature-vector and do nearest neighbor search. Use that to create a tool to help you make a database. IE: you mark the first "A2 Whole Milk 2L" as "milk" yourself, and then see if its nearest neighbors are milk. Give yourself a way to quickly mark "yes" and "needs review", or some similar such option.

For simple data such as you suggested, where it will work 90% of the time - you should be able to get through the data with ease. I've done similar to label several thousand documents in a day.

Once you have your own database, resolving these should be pretty straightforward. You could reuse the code to create your database to handle "unseen" data.

123

answered Oct 27 '22 13:10

Raff.Edward

Related questions
                            
                                Distance between two polylines
                            
                                Using a smoother with the L Method to determine the number of K-Means clusters
                            
                                How to calculate pointers in a binary tree with the van Emde Boas layout
                            
                                Eliminating symmetry from graphs
                            
                                How to find shortest path in this type of maze
                            
                                Minimising the sum of array columns in Matlab
                            
                                Fastest Path with Acceleration at Points
                            
                                Best way for three people to visit some graph nodes in a given order
                            
                                Is there a garbage collection algorithm that meets these requirements?
                            
                                What is the use of Google's CityHash other than alternative for hashcode string generation?
                            
                                Designing Algorithms that Require Scratch Space
                            
                                Trying to optimise fuzzy matching
                            
                                Arranging letters in the most pronounceable way?
                            
                                How to efficiently pass function through?
                            
                                Lossless compression in small blocks with precomputed dictionary
                            
                                Player rating for game with random teams
                            
                                how to find Connected Component dynamically
                            
                                Printing numbers of the form 2^i * 5^j in increasing order
                            
                                Java 2D weighted data interpolation
                            
                                Divvying people into rooms by last name?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Comparing and matching product names from different stores/suppliers

Tags:

algorithm

machine-learning

nlp

victorhooi

People also ask

1 Answers

Raff.Edward

Recent Activity

Donate For Us