Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best machine learning technique for matching product strings

Here's a puzzle...

I have two databases of the same 50000+ electronic products and I want to match products in one database to those in the other. However, the product names are not always identical. I've tried using the Levenshtein distance for measuring the string similarity however this hasn't worked. For example,

-LG 42CS560 42-Inch 1080p 60Hz LCD HDTV -LG 42 Inch 1080p LCD HDTV 

These items are the same, yet their product names vary quite a lot.

On the other hand...

-LG 42 Inch 1080p LCD HDTV -LG 50 Inch 1080p LCD HDTV 

These are different products with very similar product names.

How should I tackle this problem?

like image 274
Nathan Avatar asked Aug 16 '12 02:08

Nathan


People also ask

Which is the best string matching algorithm?

The Karp-Rabin Algorithm.

Can machine learning work with strings?

You cannot use strings as inputs to ML or DL models, you have to encode the strings into vectors of numbers using different techniques available and then use them as inputs to your model.

Which algorithm is used to match two strings?

Levenshtein distance is the most frequently used algorithm. It was founded by the Russian scientist, Vladimir Levenshtein to calculate the similarities between two strings. This is also known as the Edit distance-based algorithm as it computes the number of edits required to transform one string to another.

What is fuzzy Wuzzy algorithm?

FuzzyWuzzy is a library of Python which is used for string matching. Fuzzy string matching is the process of finding strings that match a given pattern. Basically it uses Levenshtein Distance to calculate the differences between sequences.


2 Answers

My first thought is to try to parse the names into a description of features (company LG, size 42 Inch, resolution 1080p, type LCD HDTV). Then you can match these descriptions against each other for compatibility; it's okay to omit a product number but bad to have different sizes. Simple are-the-common-attributes-compatible might be enough, or you might have to write / learn rules about how much different attributes are allowed to differ and so on.

Depending on how many different kinds of products you have and how different the listed names are, I might actually start by manually defining a set of attributes and possibly even just adding specific words / regexes to match them, iteratively seeing what isn't been parsed so far and adding rules for that. I'd imagine there's not a lot of ambiguity in terms of one vocabulary item possibly belonging to multiple attributes, though without seeing your database I guess I don't know.

If that's not going to be feasible, this extraction is kind of analogous to semi-supervised part-of-speech tagging. It's somewhat different, though, in that I imagine the vocabulary is much more limited than typical parsing, and in that the space of product names is more heirarchical: the resolution tag only applies to certain kinds of products. I'm not very familiar with that literature; there might be some ideas you could use.

like image 71
Danica Avatar answered Sep 21 '22 18:09

Danica


Use a large set of training examples. For each possible pair in this example set:

  1. Parse the string for its components, viz. company, size_desc, display_type, make and so on.
  2. Find the distance between the same components between the two strings of a pair.
  3. Create a tuple of numbers representing the distance between the components.
  4. Label the tuple as identical/non-identical based on the strings in the pair as part of the training set.
  5. Feed the tuples and train a binary classifier (SVM).

Now, when you get a pair of strings for which you want to decide if they are same or not, extract the features like you did in the training set and create the tuple of numbers for the distance between the various components of the string. Feed the tuple to the trained SVM and classify if they are same or not.

The advantage of using a learning approach like this is that you don't have to keep modifying the rules over and over again, and also the system learns the differences between a large pair of products that are same and different.

You could use LibSVM package in WEKA to do this.

like image 29
London guy Avatar answered Sep 22 '22 18:09

London guy