I have a set of trainFeatures
and a set of testFeatures
with positive, neutral and negative labels:
trainFeats = negFeats + posFeats + neutralFeats
testFeats = negFeats + posFeats + neutralFeats
For example, one entry inside the trainFeats
is
(['blue', 'yellow', 'green'], 'POSITIVE')
the same for the list of test features, so I specify the labels for each set. My question is how can I use the scikit implementation of Random Forest classifier and SVM to get the accuracy of this classifier altogether with precision and recall scores for each class? The problem is that I am currently using words as features, while from what I read these classifiers require numbers. Is there a way I can achieve my purpose without changing functionality? Many thanks!
The Random Forest (RF) classifiers are suitable for dealing with the high dimensional noisy data in text classification. An RF model comprises a set of decision trees each of which is trained using random subsets of features.
Linear Support Vector Machine is widely regarded as one of the best text classification algorithms.
You can look into this scikit-learn tutorial and especially the section on learning and predicting for how to create and use a classifier. The example uses SVM, however it is simple to use RandomForestClassifier instead as all classifiers implement the fit
and predict
methods.
When working with text features you can use CountVectorizer or DictVectorizer. Take a look at feature extraction and especially section 4.1.3.
You can find an example for classifying text documents here.
Then you can get the precision and recall of the classifier with the classification report.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With