How to perform logistic regression using vowpal wabbit on very imbalanced dataset

Tags:

I am trying to use vowpal wabbit for logistic regression. I am not sure if this is the right syntax to do it

For training, I do

 ./vw -d ~/Desktop/new_data.txt --passes 20 --binary --cache_file cache.txt -f lr.vw --loss_function logistic --l1 0.05

For testing I do 
./vw -d ~/libsvm-3.18_test/matlab/new_data_test.txt --binary -t -i lr.vw -p predictions.txt -r raw_score.txt

Here is a snippet from my train data

-1:1.00038 | 110:0.30103 262:0.90309 689:1.20412 1103:0.477121 1286:1.5563 2663:0.30103 2667:0.30103 2715:4.63112 3012:0.30103 3113:8.38411 3119:4.62325 3382:1.07918 3666:1.20412 3728:5.14959 4029:0.30103 4596:0.30103

1:2601.25 | 32:2.03342 135:3.77379 146:3.19535 284:2.5563 408:0.30103 542:3.80618 669:1.07918 689:2.25527 880:0.30103 915:1.98227 1169:5.35371 1270:0.90309 1425:0.30103 1621:0.30103 1682:0.30103 1736:3.98227 1770:0.60206 1861:4.34341 1900:3.43136 1905:7.54141 1991:5.33791 2437:0.954243 2532:2.68664 3370:2.90309 3497:0.30103 3546:0.30103 3733:0.30103 3963:0.90309 4152:3.23754 4205:1.68124 4228:0.90309 4257:1.07918 4456:0.954243 4483:0.30103 4766:0.30103

Here is a snippet from my test data

-1 | 110:0.90309 146:1.64345 543:0.30103 689:0.30103 1103:0.477121 1203:0.30103 1286:2.82737 1892:0.30103 2271:0.30103 2715:4.30449 3012:0.30103 3113:7.99039 3119:4.08814 3382:1.68124 3666:0.60206 3728:5.154 3960:0.778151 4309:0.30103 4596:0.30103 4648:0.477121

However, if I look at the results, the predictions are all -1 and the raw scores are all 0s. I have around 200,000 examples, out of which 100 are +1 and the rest are -1. To handle this unbalanced data, I gave the positive examples weight of 200,000/100 and the negative example weight of 200,000/(200000-100). Is it because my data is like really highly unbalanced even though I adjust the weights that this is happening?

I was expecting the output of (P(y|x)) in the raw score file. But I get all zeros. I just need the probability outputs. Any suggestions what's going on guys?

328

asked Jul 08 '14 14:07

user34790

1 Answers

A similar question was posted on the vw mailing list. I'll try to summarize the main points in all responses here for the benefit of future users.

Unbalanced training sets best practices:

Your training set is highly unbalanced (200,000 to 100). This means that only 0.0005 (0.05%) of examples have a label of 1. By always predicting -1, the classifier achieves a remarkable accuracy of 99.95%. In other words, if the cost of a false-positive is equal to the cost of a false-negative, this is actually an excellent classifier. If you are looking for an equal-weighted result, you need to do two things:

Reweigh your examples so the smaller group would have equal weight to the larger one
Reorder/shuffle the examples so positives and negatives are intermixed.

The 2nd point is especially important in online-learning where the learning rate decays with time. It follows that the ideal order, assuming you are allowed to freely reorder (e.g. no time-dependence between examples), for online-learning is a completely uniform shuffle (1, -1, 1, -1, ...)

Also note that the syntax for the example-weights (assuming a 2000:1 prevalence ratio) needs to be something like the following:

    1   2000  optional-tag| features ...
    -1  1     optional-tag| features ...

And as mentioned above, breaking down the single 2000 weighted example to have only a weight of 1 while repeating it 2000 times and interleaving it with the 2000 common examples (those with the -1 label) instead:

   1  | ...
   -1 | ...
   1  | ...  # repeated, very rare, example
   -1 | ...
   1  | ...  # repeated, very rare, example

Should lead to even better results in terms of smoother convergence and lower training loss. *Caveat: as a general rule repeating any example too much, like in the case of a 1:2000 ratio, is very likely to lead to over-fitting the repeated class. You may want to counter that by slower learning (using --learning_rate ...) and/or randomized resampling: (using --bootstrap ...)

Consider downsampling the prevalent class

To avoid over-fitting: rather than overweighting the rare class by 2000x, consider going the opposite way and "underweight" the more common class by throwing away most of its examples. While this may sound surprising (how can throwing away perfectly good data be beneficial?) it will avoid over-fitting of the repeated class as described above, and may actually lead to better generalization. Depending on the case, and costs of a false classification, the optimal down-sampling factor may vary (it is not necessarily 1/2000 in this case but may be anywhere between 1 and 1/2000). Another approach requiring some programming is to use active-learning: train on a very small part of the data, then continue to predict the class without learning (-t or zero weight); if the class is the prevalent class and the online classifier is very certain of the result (predicted value is extreme, or very close to -1 when using --link glf1), throw the redundant example away. IOW: focus your training on the boundary cases only.

Use of --binary (depends on your need)

--binary outputs the sign of the prediction (and calculates progressive loss accordingly). If you want probabilities, do not use --binary and pipe vw prediction output into utl/logistic (in the source tree). utl/logistic will map the raw prediction into signed probabilities in the range [-1, +1].

One effect of --binary is misleading (optimistic) loss. Clamping predictions to {-1, +1}, can dramatically increase the apparent accuracy as every correct prediction has a loss of 0.0. This might be misleading as just adding --binary often makes it look as if the model is much more accurate (sometimes perfectly accurate) than without --binary.

Update (Sep 2014): a new option was recently added to vw: --link logistic which implements [0,1] mapping, while predicting, inside vw. Similarly, --link glf1 implements the more commonly needed [-1, 1] mapping. mnemonic: glf1 stands for "generalized logistic function with a [-1, 1] range"

Go easy on --l1 and --l2

It is a common mistake to use high --l1 and/or --l2 values. The values are used directly per example, rather than, say, relative to 1.0. More precisely: in vw: l1 and l2 apply directly to the sum of gradients (or the "norm") in each example. Try to use much lower values, like --l1 1e-8. utl/vw-hypersearch can help you with finding optimal values of various hyper-parameters.

Be careful with multiple passes

It is a common mistake to use --passes 20 in order to minimize training error. Remember that the goal is to minimize generalization error rather than training error. Even with the cool addition of holdout (thanks to Zhen Qin) where vw automatically early-terminates when error stops going down on automatically held-out data (by default every 10th example is being held-out), multiple passes will eventually start to over-fit the held-out data (the "no free lunch" principle).

121

answered Oct 22 '22 09:10

arielf

Related questions
                            
                                Retraining after Cross Validation with libsvm
                            
                                Cost function in logistic regression gives NaN as a result
                            
                                What is the difference between classification and prediction?
                            
                                How to read the classifier confusion matrix in WEKA
                            
                                Determine whether the two classes are linearly separable (algorithmically in 2D)
                            
                                How to read data into TensorFlow batches from example queue?
                            
                                How to set a threshold for a sklearn classifier based on ROC results?
                            
                                Keras Classification - Object Detection
                            
                                KNN classification with categorical data
                            
                                GBM R function: get variable importance separately for each class
                            
                                Neural networks for email spam detection
                            
                                Lucene: exception - Query parser encountered <EOF> after "some word"
                            
                                Finding K-nearest neighbors and its implementation
                            
                                Scikit classification report - change the format of displayed results
                            
                                access to numbers in classification_report - sklearn
                            
                                Loss & accuracy - Are these reasonable learning curves?
                            
                                Multi-layer neural network won't predict negative values
                            
                                How to implement decision tree with c# (visual studio 2008) - Help
                            
                                Combining random forest models in scikit learn
                            
                                Where is it best to use svm with linear kernel?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to perform logistic regression using vowpal wabbit on very imbalanced dataset

Tags:

classification

logistic-regression

vowpalwabbit

user34790

People also ask

1 Answers

arielf

Recent Activity

Donate For Us