Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLTK: Document Classification with numeric score instead of labels

Tags:

python

nltk

In the light of a project I've been playing with Python NLTK and Document Classification and the Naive Bayes classifier. As I understand from the documentation, this works very well if your different documents are tagged with either pos or neg as a label (or more than 2 labels)

The documents I'm working with that are already classified don't have labels, but they have a score, a floating point between 0 and 5.

What I would like to do is build a classifier, like the movies example in the documentation, but that would predict the score of a piece of text, rather than the label. I believe this is mentioned in the docs but never further explored as 'probabilities of numeric features'

I am not a language expert nor a statistician so if someone has an example of this lying around I would be most grateful if you would share this with me. Thanks!

like image 863
user1765949 Avatar asked Oct 22 '12 16:10

user1765949


1 Answers

What you're looking for is linear regression, and scikit-learn is much better than NLTK for this, see http://scikit-learn.org/stable/modules/linear_model.html

like image 109
Jacob Avatar answered Oct 12 '22 23:10

Jacob