I have a binary class dataset (0 / 1) with a large skew towards the "0" class (about 30000 vs 1500). There are 7 features for each instance, no missing values.
When I use the J48 or any other tree classifier, I get almost all of the "1" instances misclassified as "0".
Setting the classifier to "unpruned", setting minimum number of instances per leaf to 1, setting confidence factor to 1, adding a dummy attribute with instance ID number - all of this didn't help.
I just can't create a model that overfits my data!
I've also tried almost all of the other classifiers Weka provides, but got similar results.
Using IB1 gets 100% accuracy (trainset on trainset) so it's not a problem of multiple instances with the same feature values and different classes.
How can I create a completely unpruned tree? Or otherwise force Weka to overfit my data?
Thanks.
Update: Okay, this is absurd. I've used only about 3100 negative and 1200 positive examples, and this is the tree I got (unpruned!):
J48 unpruned tree
------------------
F <= 0.90747: 1 (201.0/54.0)
F > 0.90747: 0 (4153.0/1062.0)
Needless to say, IB1 still gives 100% precision.
Update 2: Don't know how I missed it - unpruned SimpleCart works and gives 100% accuracy train on train; pruned SimpleCart is not as biased as J48 and has a decent false positive and negative ratio.
Overfitting means that you made your model "memorize" the training set by giving "bigger" training set. Or your testing set very much resembles to your training set. Resampling basically selects different sets for training and testing.
Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model.
The common pattern for overfitting can be seen on learning curve plots, where model performance on the training dataset continues to improve (e.g. loss or error continues to fall or accuracy continues to rise) and performance on the test or validation set improves to a point and then begins to get worse.
Overfitting can be identified by checking validation metrics such as accuracy and loss. The validation metrics usually increase until a point where they stagnate or start declining when the model is affected by overfitting.
Weka contains two meta-classifiers of interest:
They allows you to make any algorithm cost-sensitive (not restricted to SVM) and to specify a cost matrix (penalty of the various errors); you would give a higher penalty for misclassifying 1
instances as 0
than you would give for erroneously classifying 0
as 1
.
The result is that the algorithm would then try to:
minimize expected misclassification cost (rather than the most likely class)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With