I'm trying to classify tweets according to their sentiment into three categories (Buy, Hold, Sell). I'm using R and the package e1071.
I have two data frames: one trainingset and one set of new tweets which sentiment need to be predicted.
trainingset dataframe:
+--------------------------------------------------+
**text | sentiment**
*this stock is a good buy* | Buy
*markets crash in tokyo* | Sell
*everybody excited about new products* | Hold
+--------------------------------------------------+
Now I want to train the model using the tweet text trainingset[,2]
and the sentiment category trainingset[,4]
.
classifier<-naiveBayes(trainingset[,2],as.factor(trainingset[,4]), laplace=1)
Looking into the elements of classifier with
classifier$tables$x
I find that the conditional probabilities are calculated..There are different probabilities for every tweet concerning Buy,Hold and Sell.So far so good.
However when I predict the training set with:
predict(classifier, trainingset[,2], type="raw")
I get a classification which is based only on the a-priori probabilities, which means every tweet is classified as Hold (because "Hold" had the largest share among the sentiment). So every tweet has the same probabilities for Buy, Hold, and Sell:
+--------------------------------------------------+
**Id | Buy | Hold | Sell**
1 |0.25 | 0.5 | 0.25
2 |0.25 | 0.5 | 0.25
3 |0.25 | 0.5 | 0.25
.. |..... | .... | ...
N |0.25 | 0.5 | 0.25
+--------------------------------------------------+
Any ideas what I'm doing wrong? Appreciate your help!
Thanks
It looks like you trained the model using whole sentences as inputs, while it seems that you want to use words as your input features.
Usage:
## S3 method for class 'formula' naiveBayes(formula, data, laplace = 0, ..., subset, na.action = na.pass) ## Default S3 method: naiveBayes(x, y, laplace = 0, ...) ## S3 method for class 'naiveBayes' predict(object, newdata, type = c("class", "raw"), threshold = 0.001, ...)
Arguments:
x: A numeric matrix, or a data frame of categorical and/or numeric variables. y: Class vector.
In particular, if you train naiveBayes
this way:
x <- c("john likes cake", "marry likes cats and john")
y <- as.factor(c("good", "bad"))
bayes<-naiveBayes( x,y )
you get a classifier able to recognize just these two sentences:
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = x,y = y)
A-priori probabilities:
y
bad good
0.5 0.5
Conditional probabilities:
x
x
y john likes cake marry likes cats and john
bad 0 1
good 1 0
to achieve a word level classifier you need to run it with words as inputs
x <- c("john","likes","cake","marry","likes","cats","and","john")
y <- as.factors( c("good","good", "good","bad", "bad", "bad", "bad","bad") )
bayes<-naiveBayes( x,y )
you get
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = x,y = y)
A-priori probabilities:
y
bad good
0.625 0.375
Conditional probabilities:
x
y and cake cats john likes marry
bad 0.2000000 0.0000000 0.2000000 0.2000000 0.2000000 0.2000000
good 0.0000000 0.3333333 0.0000000 0.3333333 0.3333333 0.0000000
In general R
is not well suited for processing NLP data, python
(or at least Java
) would be much better choice.
To convert a sentence to the words, you can use the strsplit
function
unlist(strsplit("john likes cake"," "))
[1] "john" "likes" "cake"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With