Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit learn - How to use SVM and Random Forest for text classification?

I have a set of trainFeatures and a set of testFeatures with positive, neutral and negative labels:

trainFeats = negFeats + posFeats + neutralFeats
testFeats  = negFeats + posFeats + neutralFeats

For example, one entry inside the trainFeats is

(['blue', 'yellow', 'green'], 'POSITIVE') 

the same for the list of test features, so I specify the labels for each set. My question is how can I use the scikit implementation of Random Forest classifier and SVM to get the accuracy of this classifier altogether with precision and recall scores for each class? The problem is that I am currently using words as features, while from what I read these classifiers require numbers. Is there a way I can achieve my purpose without changing functionality? Many thanks!

like image 571
Crista23 Avatar asked Feb 23 '14 20:02

Crista23


People also ask

Can we use random forest for text classification?

The Random Forest (RF) classifiers are suitable for dealing with the high dimensional noisy data in text classification. An RF model comprises a set of decision trees each of which is trained using random subsets of features.

Which machine learning algorithm is best for text classification?

Linear Support Vector Machine is widely regarded as one of the best text classification algorithms.


1 Answers

You can look into this scikit-learn tutorial and especially the section on learning and predicting for how to create and use a classifier. The example uses SVM, however it is simple to use RandomForestClassifier instead as all classifiers implement the fit and predict methods.

When working with text features you can use CountVectorizer or DictVectorizer. Take a look at feature extraction and especially section 4.1.3.

You can find an example for classifying text documents here.

Then you can get the precision and recall of the classifier with the classification report.

like image 152
dnll Avatar answered Sep 19 '22 19:09

dnll