Algorithm to classify a list of products? [closed]

Question

I have a list representing products which are more or less the same. For instance, in the list below, they are all Seagate hard drives.

Seagate Hard Drive 500Go
Seagate Hard Drive 120Go for laptop
Seagate Barracuda 7200.12 ST3500418AS 500GB 7200 RPM SATA 3.0Gb/s Hard Drive
New and shinny 500Go hard drive from Seagate
Seagate Barracuda 7200.12
Seagate FreeAgent Desk 500GB External Hard Drive Silver 7200RPM USB2.0 Retail

For a human being, the hard drives 3 and 5 are the same. We could go a little bit further and suppose that the products 1, 3, 4 and 5 are the same and put in other categories the product 2 and 6.

We have a huge list of products that I would like to classify. Does anybody have an idea of what would be the best algorithm to do such thing. Any suggestions?

I though of a Bayesian classifier but I am not sure if it is the best choice. Any help would be appreciated!

Thanks.

Manuel · Accepted Answer

You need at least two components:

First, you need something that does "feature" extraction, i.e. that takes your items and extracts the relevant information. For example, "new and shinny" is not as relevant as "500Go hard drive" and "seagate". A (very) simple approach would consist of a simple heuristic extracting manufacturers, technology names like "USB2.0" and patterns like "GB", "RPM" from each item.

You then end up with a set of features for each item. Some machine learning people like to put this into a "feature vector", i.e. it has one entry for each feature, being set to 0 or 1, depending on whether the feature exists or not. This is your data representation. On this vectors you can then do a distance comparison.

Note that you might end up with a vector of thousands of entries. Even then, you then have to cluster your results.

Possibly useful Wikipedia articles:

Feature Extraction
Nearest Neighbour Search

Ralph M. Rickenbach · Answer

To actually classify a product, you could use somewhat of a "enhanced neural network" with a blackboard. (This is just a metaphore to get you thinking in the right direction, not a strict use of the terms.)

Imagine a set of objects that are connected through listeners or events (just like neurons and synapsis). Each object has a set of patterns and tests the input against these patterns.

An example:

One object tests for ("seagate"|"connor"|"maxtor"|"quantum"| ...)
Another object tests for [:digit:]*(" ")?("gb"|"mb")
Another object tests for [:digit:]*(" ")?"rpm"

All these objects connect to another object that, if certain combinations of them fire, categorizes the input as a hard drive. The individual objects themselves would enter certain characterizations into the black board (common writing area to say things about the input) such as manufacturer, capacity, or speed.

So the neurons do not fire based on a threshhold, but on a recognition of a pattern. Many of these neurons can work highly parallel on the blackboard and even correct categorizations by other neurons (maybe introducing certainties?)

I used something like this in a prototype for a product used to classify products according to UNSPSC and was able to get 97% correct classification on car parts.

Algorithm to classify a list of products? [closed]

Tags:

algorithm

nlp

Martin

2 Answers

Manuel

Ralph M. Rickenbach

Recent Activity

Donate For Us

Algorithm to classify a list of products? [closed]

Tags:

algorithm

nlp

Martin

2 Answers

Manuel

Ralph M. Rickenbach

Related questions

Recent Activity

Donate For Us