I am using scikit-learn Multinomial Naive Bayes classifier for binary text classification (classifier tells me whether the document belongs to the category X or not). I use a balanced dataset to train my model and a balanced test set to test it and the results are very promising.
This classifer needs to run in real time and constantly analyze documents thrown at it randomly.
However, when I run my classifier in production, the number of false positives is very high and therefore I end up with a very low precision. The reason is simple: there are many more negative samples that the classifer encounters in the real-time scenario (around 90 % of the time) and this does not correspond to the ideal balanced dataset I used for testing and training.
Is there a way I can simulate this real-time case during training or are there any tricks that I can use (including pre-processing on the documents to see if they are suitable for the classifer)?
I was planning to train my classifier using an imbalanced dataset with the same proportions as I have in real-time case but I am afraid that might bias Naive Bayes towards the negative class and lose the recall I have on the positive class.
Any advice is appreciated.
Yes, naive Bayes is affected by imbalanced data. Even though the likelihood probabilities are similar to some extent, but the posterior probability is badly affected by prior probabilities.
A widely adopted and perhaps the most straightforward method for dealing with highly imbalanced datasets is called resampling. It consists of removing samples from the majority class (under-sampling) and/or adding more examples from the minority class (over-sampling).
Another way to describe the imbalance of classes in a dataset is to summarize the class distribution as percentages of the training dataset. For example, an imbalanced multiclass classification problem may have 80 percent examples in the first class, 18 percent in the second class, and 2 percent in a third class.
You have encountered one of the problems with classification with a highly imbalanced class distribution. I have to disagree with those that state the problem is with the Naive Bayes method, and I'll provide an explanation which should hopefully illustrate what the problem is.
Imagine your false positive rate is 0.01, and your true positive rate is 0.9. This means your false negative rate is 0.1 and your true negative rate is 0.99.
Imagine an idealised test scenario where you have 100 test cases from each class. You'll get (in expectation) 1 false positive and 90 true positives. Great! Precision is 90 / (90+1) on your positive class!
Now imagine there are 1000 times more negative examples than positive. Same 100 positive examples at test, but now there are 1000000 negative examples. You now get the same 90 true positives, but (0.01 * 1000000) = 10000 false positives. Disaster! Your precision is now almost zero (90 / (90+10000)).
The point here is that the performance of the classifier hasn't changed; false positive and true positive rates remained constant, but the balance changed and your precision figures dived as a result.
What to do about it is harder. If your scores are separable but the threshold is wrong, you should look at the ROC curve for thresholds based on the posterior probability and look to see if there's somewhere where you get the kind of performance you want. If your scores are not separable, try a bunch of different classifiers and see if you can get one where they are (logistic regression is pretty much a drop-in replacement for Naive Bayes; you might want to experiment with some non-linear classifiers, however, like a neural net or non-linear SVM, as you can often end up with non-linear boundaries delineating the space of a very small class).
To simulate this effect from a balanced test set, you can simply multiply instance counts by an appropriate multiplier in the contingency table (for instance, if your negative class is 10x the size of the positive, make every negative instance in testing add 10 counts to the contingency table instead of 1).
I hope that's of some help at least understanding the problem you're facing.
As @Ben Allison says, the issue you're facing is basically that your classifier's accuracy isn't good enough - or, more specifically: its false positive rate is too high for the class distribution it encountres.
The "textbook" solution would indeed be to train the classifier using a balanced training set, getting a "good" classifier, then find a point on the classifier's performance curve (e.g. ROC curve) which best balances between your accuracy requirements; I assume that in your case, it would be biased towards lower false positive rate, and higher false negative rate.
However, the situation may well be that the classifier is just not good enough for your requirements - at the point where the false positives are in a reasonable level, you might be missing too many good cases.
One solution for that would be, of course, to use more data, or to try another type of classifier; e.g. linear/logistic regression or SVM, which generally have good performance in text classification.
Having said that, the case may be that you prefer using Naive Bayes for some reason (e.g. constraints on train time, frequent addition of new classes or pre-exsiting models). In that case, I can give some practical advice on what can be done.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With