I’m trying to write a simple program to compare prices for products from different suppliers. Different suppliers may call the same product different things.
For example, the following three strings refer to the same product:
Or the following two strings are the same product:
Furthermore - some products are not the same, but are similar (for example, Full Cream 2L Milk may encompass various similar products.)
The only bits of information I have on each product are the title, and a price.
What are currently recommended techniques for matching product strings like this?
From my Googling and reading other SO threads, I found:
Would you use one of the above techniques, or would you use a different technique?
Also, does anybody know of any example code, or even libraries for this sort of problem? I couldn't seem to find any.
(For example, I saw that some people were having performance problems with calculating the Jaro-Winkler distance for large data-sets. I was hoping there might be a distributed implementation of the algorithm (e.g. with Mahout), but wasn’t able to find anything concrete.)
Product matching is a form of data analysis in eCommerce of great importance both internally and externally. Internally speaking, product matching is used in database cleansing: Duplicates are identified and eliminated in the product master-data of both online retailers and brand-name manufacturers.
Benefits of Product Matching: It helps retailers: Organize listings on a marketplace place platform. Discover gaps and missing information or attributes in the product catalog. Consolidate varying product data from multiple sources into a unified source.
Machine Learning uses Natural Language Processing and Algorithmic probability. The system reads the full user input and carefully analyses it. The matching strength depends on the confidence score user setup. ML is the default matching system and it's automatically enabled.
Would you use one of the above techniques, or would you use a different technique?
If I were doing this for real, I wouldn't use much machine learning. I'm sure most big companies have a database of brand and product names, and use that to match things up fairly easily. Some data sanitation might be needed - but its not much of an ML problem.
If you don't have that database, I'd say go simple. Convert everything to a feature-vector and do nearest neighbor search. Use that to create a tool to help you make a database. IE: you mark the first "A2 Whole Milk 2L" as "milk" yourself, and then see if its nearest neighbors are milk. Give yourself a way to quickly mark "yes" and "needs review", or some similar such option.
For simple data such as you suggested, where it will work 90% of the time - you should be able to get through the data with ease. I've done similar to label several thousand documents in a day.
Once you have your own database, resolving these should be pretty straightforward. You could reuse the code to create your database to handle "unseen" data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With