Naive Bayes in Spark MLlib

Question

I have a small file 'naivebayestest.txt' with this structure

10 1:1
20 1:2
20 1:2

From this data I'm trying to classify the vector (1). If I understand Bayes correctly the label for (1) should be 10 (with probability 1!). The program in Spark MLlib:

String path = "/usr/local/spark/data/mllib/bayestest.txt";
JavaRDD<LabeledPoint> training = MLUtils.loadLibSVMFile(jsc.sc(), path).toJavaRDD();
final NaiveBayesModel model = NaiveBayes.train(training.rdd());
Vector v = Vectors.dense(1);
double prediccion = model.predict(v);
System.out.println("Vector: "+v+" prediction: "+prediccion);

shows Vector: [1.0] prediction: 20.0

I obtain the same result with a training set of 1050 elements, 350 (1/3) of the form 10 1:1 and the rest 20 1:2 (2/3), and I still get the same prediction 20.0 for vector 1.

What am I doing wrong?

Yuan JI · Accepted Answer

In the source code of Spark Naive Bayes implementation, you could find the link of algorithms which are implemented:

Multinomial NB which can handle all kinds of discrete data. For example, by converting documents into TF-IDF vectors, it can be used for document classification.

Bernoulli NB by making every vector a 0-1 vector.

The input feature values must be nonnegative.

In your case, Spark used Multinomial NB (which is default), so let's dig into the algorithm.

Naive Bayes is often used for document classification, let me explain your case as a document classification case:

Let's say the classes are ten and twenty
Let's say input token (only one in this case) is Spark

So for your first row data, it will be: Spark
For the second and third, they will be: Spark Spark

As what I understood from the Multinomial NB link, the algorithm could be summarized in this equation:

where:
P(Ci) : conditional probability of test data belonging to class i
nf : number of terms in the vocabulary
Sij : sum of term frequency for class i and term j
Si : sum of term frequency for class i
λ: lambda, the smoothing value
v : input test vector
ndci : number of row data in class i
nd : total number of row data
nc : number of classes

What happened in your case

In your row data, there is only one token(or only one input feature), which means nf in the equation equals to 1
so: Sij = Si

That will make the multiplier of vector:
ln(Sij+λ) - ln(Si+nf*λ) = ln(Si+λ)-ln(Si+λ) = 0

As a consequence, the equation now is:
New Equation

Which means the result doesn't depend any more on input vector!

Now is the class who has the most row data wins the classification.

And that's why your prediction result is 20 instead of 10.

In the end

To avoid this, try using Linear Regression, Decition tree, Random Forests, GBT, etc.

hard coder · Answer

Naive Bayes model will be trained for all 3 records. Your assumption

If I understand Bayes correctly the label for (1) should be 10 (with probability 1!)

is wrong here Correct probability will be

P(10|1) = P(1|10) * P(10)/P(1)

It is by definition, but due to additive smoothing this formula may change. I am not sure what formula will that be. But it seems out due to additive smoothing the probability P(20|1) comes out to be greater than P(10|1). Hence you are given the result so.

And it would make more sense with lots of training data.

Naive Bayes in Spark MLlib

Tags:

java

naivebayes

apache-spark

apache-spark-mllib

RafaelCaballero

2 Answers

Yuan JI

hard coder

Recent Activity

Donate For Us

Naive Bayes in Spark MLlib

Tags:

java

naivebayes

apache-spark

apache-spark-mllib

RafaelCaballero

2 Answers

Yuan JI

hard coder

Related questions

Recent Activity

Donate For Us