I have a large number (100-150) of small (approx 1 kbyte) datasets. We will call these the 'good' datasets. I also have a similar number of 'bad' datasets.
Now I'm looking for software (or perhaps algorithm(s)) to find rules for what constitutes a 'good' dataset versus a 'bad' dataset.
The important thing here is the software's ability to deal with the multiple datasets rather than just one large one.
Help much appreciated.
Paul.
It seems like a classification problem. If you have many datasets labelled as "good" or "bad" you can train a classifier to predict if a new dataset is good or bad.
Algorithms such as decision tree, k-nearest neighboor, SVM, neural networks are potential tools that you could use.
However, you need to determine which attributes you will use to train the classifier.
One common way to do it is using the k-nearest neighbor.
Extract fields from your data set, for example - if your dataset is a text, a common way to extract fields is using the bag of words.
Store the "training set", and when a new dataset [which is not labled] arrives - find the k nearest beighbors to it [according to the extracted fields]. Lable the new dataset like the most k nearest neighbors [from training set] of it.
Another common method is using a decision tree. The problem with decision trees - don't make the decisioning too specific. An existing algorithm which might use to create a good [heuristically] tree is ID3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With