Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multinomial Naive Bayes parameter alpha setting? scikit-learn

Does any one know how to set parameter of alpha when doing naive bayes classification?

E.g. I used bag of words firstly to build the feature matrix and each cell of matrix is counts of words, and then I used tf(term frequency) to normalized the matrix.

But when I used Naive bayes to build classifier model, I choose to use multinomial N.B (which I think this is correct, not Bernoulli and Gaussian). the default alpha setting is 1.0 (the documents said it is Laplace smoothing, I have no idea what is).

The result is really bad, like only 21% recall to find the positive class (target class). but when I set alpha = 0.0001 (I randomly picked), the results get 95% recall score.

Besides, I checked the multinomial N.B formula, I think it is because the alpha problem, because if I used counts of words as feature, the alpha = 1 is doesn't to effect the results, however, since the tf is between 0-1, the alpha = 1 is really affect the results of this formula.

I also tested the results not use tf, only used counts of bag of words, the results is 95% as well, so, does any one know how to set the alpha value? because I have to use tf as feature matrix.

Thanks.

like image 256
HAO CHEN Avatar asked Nov 20 '15 15:11

HAO CHEN


1 Answers

why alpha is used?

For classifying query point in NB P(Y=1|W) or P(Y=0|W) (considering binary classification) here W is vector of words W= [w1, w2, w3.... wd] d = number of features

So, to find probability of all these at training time
P(w1|Y=1) * P(w2|Y=1) *.....P(wd|Y=1)) * P(Y=1)

Same above should be done for Y=0.

For Naive Bayes formula refer this (https://en.wikipedia.org/wiki/Naive_Bayes_classifier)

Now at testing time, consider you encounter word which is not present in train set then its probability of existence in a class is zero, which will make whole probability 0, which is not good.

Consider W* word not present in training set

P(W*|Y=1) = P(W*,Y=1)/P(Y=1)

      = Number of training points such that w* word present and Y=1 / Number of training point where Y=1
      = 0/Number of training point where Y=1

So to get rid of this problem we do Laplace smoothing. we add alpha to numerator and denominator field.

     = 0 + alpha / Number of training point where Y=1 + (Number of class labels in classifier * alpha)
  1. It happens in real world, some words occurs very few time and some more number of times or think in different way, in above formula (P(W|Y=1) = P(W,Y=1)/P(Y=1) ) if numerator and denominator fields are small means It is easily influenced by outlier or noise. Here also alpha helps as it moves my likelihood probabilities to uniform distribution as alpha increases.

So alpha is hyper parameter and you have to tune it using techniques like grid search (as mentioned by jakevdp) or random search. (https://towardsdatascience.com/hyperparameter-tuning-c5619e7e6624)

like image 81
Gopu_Tunas Avatar answered Nov 18 '22 17:11

Gopu_Tunas