Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Association mining with large number of small datasets

I have a large number (100-150) of small (approx 1 kbyte) datasets. We will call these the 'good' datasets. I also have a similar number of 'bad' datasets.

Now I'm looking for software (or perhaps algorithm(s)) to find rules for what constitutes a 'good' dataset versus a 'bad' dataset.

The important thing here is the software's ability to deal with the multiple datasets rather than just one large one.

Help much appreciated.
Paul.

like image 555
Paul Lovell Avatar asked Mar 04 '12 13:03

Paul Lovell


2 Answers

It seems like a classification problem. If you have many datasets labelled as "good" or "bad" you can train a classifier to predict if a new dataset is good or bad.

Algorithms such as decision tree, k-nearest neighboor, SVM, neural networks are potential tools that you could use.

However, you need to determine which attributes you will use to train the classifier.

like image 101
Phil Avatar answered Nov 12 '22 06:11

Phil


One common way to do it is using the k-nearest neighbor.

Extract fields from your data set, for example - if your dataset is a text, a common way to extract fields is using the bag of words.

Store the "training set", and when a new dataset [which is not labled] arrives - find the k nearest beighbors to it [according to the extracted fields]. Lable the new dataset like the most k nearest neighbors [from training set] of it.

Another common method is using a decision tree. The problem with decision trees - don't make the decisioning too specific. An existing algorithm which might use to create a good [heuristically] tree is ID3

like image 37
amit Avatar answered Nov 12 '22 06:11

amit