I have a small file 'naivebayestest.txt' with this structure
10 1:1
20 1:2
20 1:2
From this data I'm trying to classify the vector (1). If I understand Bayes correctly the label for (1) should be 10 (with probability 1!). The program in Spark MLlib:
String path = "/usr/local/spark/data/mllib/bayestest.txt";
JavaRDD<LabeledPoint> training = MLUtils.loadLibSVMFile(jsc.sc(), path).toJavaRDD();
final NaiveBayesModel model = NaiveBayes.train(training.rdd());
Vector v = Vectors.dense(1);
double prediccion = model.predict(v);
System.out.println("Vector: "+v+" prediction: "+prediccion);
shows Vector: [1.0] prediction: 20.0
I obtain the same result with a training set of 1050 elements, 350 (1/3) of the form 10 1:1
and the rest 20 1:2
(2/3), and I still get the same prediction 20.0 for vector 1.
What am I doing wrong?
In the source code of Spark Naive Bayes implementation, you could find the link of algorithms which are implemented:
- Multinomial NB which can handle all kinds of discrete data. For example, by converting documents into TF-IDF vectors, it can be used for document classification.
- Bernoulli NB by making every vector a 0-1 vector.
The input feature values must be nonnegative.
In your case, Spark used Multinomial NB (which is default), so let's dig into the algorithm.
Naive Bayes is often used for document classification, let me explain your case as a document classification case:
ten
and twenty
Spark
So for your first row data, it will be: Spark
For the second and third, they will be: Spark Spark
As what I understood from the Multinomial NB link, the algorithm could be summarized in this equation:
where:
P(Ci) : conditional probability of test data belonging to class i
nf : number of terms in the vocabulary
Sij : sum of term frequency for class i and term j
Si : sum of term frequency for class i
λ: lambda, the smoothing value
v : input test vector
ndci : number of row data in class i
nd : total number of row data
nc : number of classes
What happened in your case
In your row data, there is only one token(or only one input feature), which means nf
in the equation equals to 1
so: Sij = Si
That will make the multiplier of vector:ln(Sij+λ) - ln(Si+nf*λ) = ln(Si+λ)-ln(Si+λ) = 0
As a consequence, the equation now is:
Which means the result doesn't depend any more on input vector!
Now is the class who has the most row data wins the classification.
And that's why your prediction result is 20 instead of 10.
In the end
To avoid this, try using Linear Regression, Decition tree, Random Forests, GBT, etc.
Naive Bayes
model will be trained for all 3 records. Your assumption
If I understand Bayes correctly the label for (1) should be 10 (with probability 1!)
is wrong here Correct probability will be
P(10|1) = P(1|10) * P(10)/P(1)
It is by definition, but due to additive smoothing this formula may change. I am not sure what formula will that be. But it seems out due to additive smoothing the probability P(20|1) comes out to be greater than P(10|1). Hence you are given the result so.
And it would make more sense with lots of training data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With