In the light of a project I've been playing with Python NLTK and Document Classification and the Naive Bayes classifier. As I understand from the documentation, this works very well if your different documents are tagged with either pos or neg as a label (or more than 2 labels)
The documents I'm working with that are already classified don't have labels, but they have a score, a floating point between 0 and 5.
What I would like to do is build a classifier, like the movies example in the documentation, but that would predict the score of a piece of text, rather than the label. I believe this is mentioned in the docs but never further explored as 'probabilities of numeric features'
I am not a language expert nor a statistician so if someone has an example of this lying around I would be most grateful if you would share this with me. Thanks!
What you're looking for is linear regression, and scikit-learn is much better than NLTK for this, see http://scikit-learn.org/stable/modules/linear_model.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With