Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Prepare data for text classification using Scikit Learn SVM

I'm trying to apply SVM from Scikit learn to classify the tweets I collected. So, there will be two categories, name them A and B. For now, I have all the tweets categorized in two text file, 'A.txt' and 'B.txt'. However, I'm not sure what type of data inputs the Scikit Learn SVM is asking for. I have a dictionary with labels (A and B) as its keys and a dictionary of features (unigrams) and their frequencies as values. Sorry, I'm really new to machine learning and not sure what I should do to get the SVM work. And I found that SVM is using numpy.ndarray as the type of its data input. Do I need to create one based on my own data? Should it be something like this?

Labels    features    frequency
  A        'book'        54
  B       'movies'       32

Any help is appreciated.

like image 825
user1906856 Avatar asked Dec 18 '12 22:12

user1906856


People also ask

Can SVM be used for text classification?

There are many different machine learning algorithms we can choose from when doing text classification with machine learning. One of those is Support Vector Machines (or SVM).

How SVM is used in NLP?

So, we use SVM to mainly classify data but we can also use it for regression. It is a fast and dependable algorithm and works well with fewer data. A very simple definition would be that SVM is a supervised algorithm that classifies or separates data using hyperplanes.


1 Answers

Have a look at the documentation on text feature extraction.

Also have a look at the text classification example.

There is also a tutorial here:

http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

In particular don't focus too much on SVM models (in particular not sklearn.svm.SVC that is more interesting for kernel models hence not text classification): a simple Perceptron, LogisticRegression or Bernoulli naive Bayes models might work as good while being much faster to train.

like image 154
ogrisel Avatar answered Oct 23 '22 09:10

ogrisel