Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark, MLlib: Adjusting classifier descrimination threshold

I try to use Spark MLlib Logistic Regression (LR) and/or Random Forests (RF) classifiers to create model to descriminate between two classes reprsented by sets which cardinality differes quite a lot.
One set has 150 000 000 negative and and another just 50 000 positive instances.

After training both LR and RF classifiers with default parameters I get very similar results for both classifiers with, for example, for the following test set:

Test instances: 26842
Test positives = 433.0 
Test negatives = 26409.0

Classifier detects:

truePositives = 0.0  
trueNegatives = 26409.0  
falsePositives = 433.0  
falseNegatives = 0.0 
Precision = 0.9838685641904478
Recall = 0.9838685641904478

It looks like classifier can not detect any positive instance at all. Also, no matter how data was split into train and test sets, classifier provides exactly the same number of false positives equal to a number of positives that test set really has.

LR classifier default threshold is set to 0.5 Setting threshold to 0.8 does not make any difference.

val model =  new LogisticRegressionWithLBFGS().run(training)
model.setThreshold(0.8)

Questions:

1) Please advise how to manipulate classifier threshold to make classifier more sensetive to a class with a tiny fraction of positive instances vs a class with huge amount of negative instances?

2) Any other MLlib classifiers to solve this problem?

3) What itercept parameter does to the Logistic Regression algorithm?

val model = new LogisticRegressionWithSGD().setIntercept(true).run(training)
like image 510
zork Avatar asked Aug 03 '15 16:08

zork


1 Answers

Well, I think what you have here is a very unbalance data set problem: 150 000 000 Class1 50 000 Class2. 3000 times smaller.

So if you train a classifier that assumes all are Class1 you are going to have: 0.999666 accuracy. So the best classifier will always be ALL are Class1. This is what your model is learning here.

There are different ways to assess these cases, in general you can do, downsampling the larger Class, or up-sampling the smaller class, or there are some other things you can do with randomforests for example when you sample do it in a balanced way (stratified), or add weights:

http://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf

Other methods also exist like SMOTE,etc (also doing samples) for more details you can read here:

https://www3.nd.edu/~dial/papers/SPRINGER05.pdf

The threshold you can change for your logistic regression is going to be the probability one, you can try playing with "probabilityCol" in the parameters of the logistic regression example here:

http://spark.apache.org/docs/latest/ml-guide.html

But a problem now with MLlib is that not all classifiers are returning a probability, I asked them about this and it is in their roadmap.

like image 139
Dr VComas Avatar answered Oct 03 '22 05:10

Dr VComas