Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to classify a list of products? [closed]

Tags:

algorithm

nlp

I have a list representing products which are more or less the same. For instance, in the list below, they are all Seagate hard drives.

  1. Seagate Hard Drive 500Go
  2. Seagate Hard Drive 120Go for laptop
  3. Seagate Barracuda 7200.12 ST3500418AS 500GB 7200 RPM SATA 3.0Gb/s Hard Drive
  4. New and shinny 500Go hard drive from Seagate
  5. Seagate Barracuda 7200.12
  6. Seagate FreeAgent Desk 500GB External Hard Drive Silver 7200RPM USB2.0 Retail

For a human being, the hard drives 3 and 5 are the same. We could go a little bit further and suppose that the products 1, 3, 4 and 5 are the same and put in other categories the product 2 and 6.

We have a huge list of products that I would like to classify. Does anybody have an idea of what would be the best algorithm to do such thing. Any suggestions?

I though of a Bayesian classifier but I am not sure if it is the best choice. Any help would be appreciated!

Thanks.

like image 715
Martin Avatar asked Mar 29 '09 20:03

Martin


2 Answers

You need at least two components:

First, you need something that does "feature" extraction, i.e. that takes your items and extracts the relevant information. For example, "new and shinny" is not as relevant as "500Go hard drive" and "seagate". A (very) simple approach would consist of a simple heuristic extracting manufacturers, technology names like "USB2.0" and patterns like "GB", "RPM" from each item.

You then end up with a set of features for each item. Some machine learning people like to put this into a "feature vector", i.e. it has one entry for each feature, being set to 0 or 1, depending on whether the feature exists or not. This is your data representation. On this vectors you can then do a distance comparison.

Note that you might end up with a vector of thousands of entries. Even then, you then have to cluster your results.

Possibly useful Wikipedia articles:

  • Feature Extraction
  • Nearest Neighbour Search
like image 149
Manuel Avatar answered Oct 18 '22 23:10

Manuel


To actually classify a product, you could use somewhat of a "enhanced neural network" with a blackboard. (This is just a metaphore to get you thinking in the right direction, not a strict use of the terms.)

Imagine a set of objects that are connected through listeners or events (just like neurons and synapsis). Each object has a set of patterns and tests the input against these patterns.

An example:

  • One object tests for ("seagate"|"connor"|"maxtor"|"quantum"| ...)
  • Another object tests for [:digit:]*(" ")?("gb"|"mb")
  • Another object tests for [:digit:]*(" ")?"rpm"

All these objects connect to another object that, if certain combinations of them fire, categorizes the input as a hard drive. The individual objects themselves would enter certain characterizations into the black board (common writing area to say things about the input) such as manufacturer, capacity, or speed.

So the neurons do not fire based on a threshhold, but on a recognition of a pattern. Many of these neurons can work highly parallel on the blackboard and even correct categorizations by other neurons (maybe introducing certainties?)

I used something like this in a prototype for a product used to classify products according to UNSPSC and was able to get 97% correct classification on car parts.

like image 1
Ralph M. Rickenbach Avatar answered Oct 19 '22 00:10

Ralph M. Rickenbach