I have a list representing products which are more or less the same. For instance, in the list below, they are all Seagate hard drives.
For a human being, the hard drives 3 and 5 are the same. We could go a little bit further and suppose that the products 1, 3, 4 and 5 are the same and put in other categories the product 2 and 6.
We have a huge list of products that I would like to classify. Does anybody have an idea of what would be the best algorithm to do such thing. Any suggestions?
I though of a Bayesian classifier but I am not sure if it is the best choice. Any help would be appreciated!
Thanks.
You need at least two components:
First, you need something that does "feature" extraction, i.e. that takes your items and extracts the relevant information. For example, "new and shinny" is not as relevant as "500Go hard drive" and "seagate". A (very) simple approach would consist of a simple heuristic extracting manufacturers, technology names like "USB2.0" and patterns like "GB", "RPM" from each item.
You then end up with a set of features for each item. Some machine learning people like to put this into a "feature vector", i.e. it has one entry for each feature, being set to 0 or 1, depending on whether the feature exists or not. This is your data representation. On this vectors you can then do a distance comparison.
Note that you might end up with a vector of thousands of entries. Even then, you then have to cluster your results.
Possibly useful Wikipedia articles:
To actually classify a product, you could use somewhat of a "enhanced neural network" with a blackboard. (This is just a metaphore to get you thinking in the right direction, not a strict use of the terms.)
Imagine a set of objects that are connected through listeners or events (just like neurons and synapsis). Each object has a set of patterns and tests the input against these patterns.
An example:
All these objects connect to another object that, if certain combinations of them fire, categorizes the input as a hard drive. The individual objects themselves would enter certain characterizations into the black board (common writing area to say things about the input) such as manufacturer, capacity, or speed.
So the neurons do not fire based on a threshhold, but on a recognition of a pattern. Many of these neurons can work highly parallel on the blackboard and even correct categorizations by other neurons (maybe introducing certainties?)
I used something like this in a prototype for a product used to classify products according to UNSPSC and was able to get 97% correct classification on car parts.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With