Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Naive Bayes in Spark MLlib

I have a small file 'naivebayestest.txt' with this structure

10 1:1
20 1:2
20 1:2

From this data I'm trying to classify the vector (1). If I understand Bayes correctly the label for (1) should be 10 (with probability 1!). The program in Spark MLlib:

String path = "/usr/local/spark/data/mllib/bayestest.txt";
JavaRDD<LabeledPoint> training = MLUtils.loadLibSVMFile(jsc.sc(), path).toJavaRDD();
final NaiveBayesModel model = NaiveBayes.train(training.rdd());
Vector v = Vectors.dense(1);
double prediccion = model.predict(v);
System.out.println("Vector: "+v+" prediction: "+prediccion);

shows Vector: [1.0] prediction: 20.0

I obtain the same result with a training set of 1050 elements, 350 (1/3) of the form 10 1:1 and the rest 20 1:2 (2/3), and I still get the same prediction 20.0 for vector 1.

What am I doing wrong?

like image 456
RafaelCaballero Avatar asked Jul 08 '16 06:07

RafaelCaballero


2 Answers

In the source code of Spark Naive Bayes implementation, you could find the link of algorithms which are implemented:

  1. Multinomial NB which can handle all kinds of discrete data. For example, by converting documents into TF-IDF vectors, it can be used for document classification.
  2. Bernoulli NB by making every vector a 0-1 vector.

The input feature values must be nonnegative.

In your case, Spark used Multinomial NB (which is default), so let's dig into the algorithm.

Naive Bayes is often used for document classification, let me explain your case as a document classification case:

  1. Let's say the classes are ten and twenty
  2. Let's say input token (only one in this case) is Spark

So for your first row data, it will be: Spark
For the second and third, they will be: Spark Spark

As what I understood from the Multinomial NB link, the algorithm could be summarized in this equation:
Equation

where:
P(Ci) : conditional probability of test data belonging to class i
nf : number of terms in the vocabulary
Sij : sum of term frequency for class i and term j
Si : sum of term frequency for class i
λ: lambda, the smoothing value
v : input test vector
ndci : number of row data in class i
nd : total number of row data
nc : number of classes



What happened in your case

In your row data, there is only one token(or only one input feature), which means nf in the equation equals to 1
so: Sij = Si

That will make the multiplier of vector:
ln(Sij+λ) - ln(Si+nf*λ) = ln(Si+λ)-ln(Si+λ) = 0

As a consequence, the equation now is:
New Equation

Which means the result doesn't depend any more on input vector!

Now is the class who has the most row data wins the classification.

And that's why your prediction result is 20 instead of 10.



In the end

To avoid this, try using Linear Regression, Decition tree, Random Forests, GBT, etc.

like image 145
Yuan JI Avatar answered Nov 10 '22 13:11

Yuan JI


Naive Bayes model will be trained for all 3 records. Your assumption

If I understand Bayes correctly the label for (1) should be 10 (with probability 1!)

is wrong here Correct probability will be

P(10|1) = P(1|10) * P(10)/P(1)

It is by definition, but due to additive smoothing this formula may change. I am not sure what formula will that be. But it seems out due to additive smoothing the probability P(20|1) comes out to be greater than P(10|1). Hence you are given the result so.

And it would make more sense with lots of training data.

like image 1
hard coder Avatar answered Nov 10 '22 12:11

hard coder