NLTK: Document Classification with numeric score instead of labels

Question

In the light of a project I've been playing with Python NLTK and Document Classification and the Naive Bayes classifier. As I understand from the documentation, this works very well if your different documents are tagged with either pos or neg as a label (or more than 2 labels)

The documents I'm working with that are already classified don't have labels, but they have a score, a floating point between 0 and 5.

What I would like to do is build a classifier, like the movies example in the documentation, but that would predict the score of a piece of text, rather than the label. I believe this is mentioned in the docs but never further explored as 'probabilities of numeric features'

I am not a language expert nor a statistician so if someone has an example of this lying around I would be most grateful if you would share this with me. Thanks!

Jacob · Accepted Answer

What you're looking for is linear regression, and scikit-learn is much better than NLTK for this, see http://scikit-learn.org/stable/modules/linear_model.html

NLTK: Document Classification with numeric score instead of labels

Tags:

python

nltk

user1765949

1 Answers

Jacob

Recent Activity

Donate For Us

NLTK: Document Classification with numeric score instead of labels

Tags:

python

nltk

user1765949

1 Answers

Jacob

Related questions

Recent Activity

Donate For Us