How do you usually get precision, recall and f-measure from a model created in Vowpal Wabbit on a classification problem?
Are there any available scripts or programs that are commonly used for this with vw's output?
To make a minimal example using the following data in playtennis.txt :
2 | sunny 85 85 false
2 | sunny 80 90 true
1 | overcast 83 78 false
1 | rain 70 96 false
1 | rain 68 80 false
2 | rain 65 70 true
1 | overcast 64 65 true
2 | sunny 72 95 false
1 | sunny 69 70 false
1 | rain 75 80 false
1 | sunny 75 70 true
1 | overcast 72 90 true
1 | overcast 81 75 false
2 | rain 71 80 true
I create the model with:
vw playtennis.txt --oaa 2 -f playtennis.model --loss_function logistic
Then, I get predictions and raw predictions of the trained model on the training data itself with:
vw -t -i playtennis.model playtennis.txt -p playtennis.predict -r playtennis.rawp
Going from here, what scripts or programs do you usually use to get precision, recall and f-measure, given training data playtennis.txt
and the predictions on the training data in playtennis.predict
?
Also, if this where a multi-label classification problem (each instance can have more than 1 target label, which vw can also handle), would your proposed scripts or programs capable to process these?
Given that you have a pair of 'predicted vs actual' value for each example, you can use Rich Caruana's KDD perf
utility to compute these (and many other) metrics.
In the case of multi-class, you should simply consider every correctly classified case a success and every class-mismatch a failure to predict correctly.
Here's a more detailed recipe for the binary case:
# get the labels into *.actual (correct) file
$ cut -d' ' -f1 playtennis.txt > playtennis.actual
# paste the actual vs predicted side-by-side (+ cleanup trailing zeros)
$ paste playtennis.actual playtennis.predict | sed 's/\.0*$//' > playtennis.ap
# convert original (1,2) classes to binary (0,1):
$ perl -pe 's/1/0/g; s/2/1/g;' playtennis.ap > playtennis.ap01
# run perf to determine precision, recall and F-measure:
$ perf -PRE -REC -PRF -file playtennis.ap01
PRE 1.00000 pred_thresh 0.500000
REC 0.80000 pred_thresh 0.500000
PRF 0.88889 pred_thresh 0.500000
Note that as Martin mentioned, vw
uses the {-1, +1} convention for binary classification, whereas perf
uses the {0, 1} convention so you may have to translate back and forth when switching between the two.
For binary classification, I would recommend to use labels +1 (play tennis) and -1 (don't play tennis) and --loss_function=logistic
(although --oaa 2
and labels 1 and 2 can be used as well). VW then reports the logistic loss, which may be more informative/useful evaluation measure than accuracy/precision/recall/f1 (depending on the application). If you want 0/1 loss (i.e. "one minus accuracy"), add --binary
.
For precision, recall, f1-score, auc and other measures, you can use the perf tool as suggested in arielf's answer.
For standard multi-class classification (one correct class for each example), use --oaa N --loss_function=logistic
and VW will report the 0/1 loss.
For multi-label multi-class classification (more correct labels per example allowed), you can use --multilabel_oaa N
(or convert each original example into N binary-classification examples).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With