Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dealing with negative values in sklearn MultinomialNB

I am normalizing my text input before running MultinomialNB in sklearn like this:

vectorizer = TfidfVectorizer(max_df=0.5, stop_words='english', use_idf=True)
lsa = TruncatedSVD(n_components=100)
mnb = MultinomialNB(alpha=0.01)

train_text = vectorizer.fit_transform(raw_text_train)
train_text = lsa.fit_transform(train_text)
train_text = Normalizer(copy=False).fit_transform(train_text)

mnb.fit(train_text, train_labels)

Unfortunately, MultinomialNB does not accept the non-negative values created during the LSA stage. Any ideas for getting around this?

like image 950
seanlorenz Avatar asked Jun 11 '14 17:06

seanlorenz


People also ask

How does machine learning deal with negative values?

A common technique for handling negative values is to add a constant value to the data prior to applying the log transform. The transformation is therefore log(Y+a) where a is the constant. Some people like to choose a so that min(Y+a) is a very small positive number (like 0.001). Others choose a so that min(Y+a) = 1.

What is multinomial naive Bayes classifier?

The Multinomial Naive Bayes algorithm is a Bayesian learning approach popular in Natural Language Processing (NLP). The program guesses the tag of a text, such as an email or a newspaper story, using the Bayes theorem. It calculates each tag's likelihood for a given sample and outputs the tag with the greatest chance.


1 Answers

I recommend you that don't use Naive Bayes with SVD or other matrix factorization because Naive Bayes based on applying Bayes' theorem with strong (naive) independence assumptions between the features. Use other classifier, for example RandomForest

I tried this experiment with this results:

vectorizer = TfidfVectorizer(max_df=0.5, stop_words='english', use_idf=True)
lsa = NMF(n_components=100)
mnb = MultinomialNB(alpha=0.01)

train_text = vectorizer.fit_transform(raw_text_train)
train_text = lsa.fit_transform(train_text)
train_text = Normalizer(copy=False).fit_transform(train_text)

mnb.fit(train_text, train_labels)

This is the same case but I'm using NMP(non-negative matrix factorization) instead SVD and got 0,04% accuracy.

Changing the classifier MultinomialNB for RandomForest i got 79% accuracy.

Therefore change the classifier or don't apply a matrix factorization.

like image 194
Martin Forte Avatar answered Sep 20 '22 13:09

Martin Forte