Does any one know how to set parameter of alpha when doing naive bayes classification?
E.g. I used bag of words firstly to build the feature matrix and each cell of matrix is counts of words, and then I used tf(term frequency) to normalized the matrix.
But when I used Naive bayes to build classifier model, I choose to use multinomial N.B (which I think this is correct, not Bernoulli and Gaussian). the default alpha setting is 1.0 (the documents said it is Laplace smoothing, I have no idea what is).
The result is really bad, like only 21% recall to find the positive class (target class). but when I set alpha = 0.0001 (I randomly picked), the results get 95% recall score.
Besides, I checked the multinomial N.B formula, I think it is because the alpha problem, because if I used counts of words as feature, the alpha = 1 is doesn't to effect the results, however, since the tf is between 0-1, the alpha = 1 is really affect the results of this formula.
I also tested the results not use tf, only used counts of bag of words, the results is 95% as well, so, does any one know how to set the alpha value? because I have to use tf as feature matrix.
Thanks.
why alpha is used?
For classifying query point in NB P(Y=1|W) or P(Y=0|W) (considering binary classification) here W is vector of words W= [w1, w2, w3.... wd] d = number of features
So, to find probability of all these at training time
P(w1|Y=1) * P(w2|Y=1) *.....P(wd|Y=1)) * P(Y=1)
Same above should be done for Y=0.
For Naive Bayes formula refer this (https://en.wikipedia.org/wiki/Naive_Bayes_classifier)
Now at testing time, consider you encounter word which is not present in train set then its probability of existence in a class is zero, which will make whole probability 0, which is not good.
Consider W* word not present in training set
P(W*|Y=1) = P(W*,Y=1)/P(Y=1)
= Number of training points such that w* word present and Y=1 / Number of training point where Y=1
= 0/Number of training point where Y=1
So to get rid of this problem we do Laplace smoothing. we add alpha to numerator and denominator field.
= 0 + alpha / Number of training point where Y=1 + (Number of class labels in classifier * alpha)
|Y=1) = P(W
,Y=1)/P(Y=1) ) if numerator and denominator fields are small means It is easily influenced by outlier or noise. Here also alpha helps as it moves my likelihood probabilities to uniform distribution as alpha increases. So alpha is hyper parameter and you have to tune it using techniques like grid search (as mentioned by jakevdp) or random search. (https://towardsdatascience.com/hyperparameter-tuning-c5619e7e6624)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With