I'm trying to apply SVM from Scikit learn to classify the tweets I collected. So, there will be two categories, name them A and B. For now, I have all the tweets categorized in two text file, 'A.txt' and 'B.txt'. However, I'm not sure what type of data inputs the Scikit Learn SVM is asking for. I have a dictionary with labels (A and B) as its keys and a dictionary of features (unigrams) and their frequencies as values. Sorry, I'm really new to machine learning and not sure what I should do to get the SVM work. And I found that SVM is using numpy.ndarray as the type of its data input. Do I need to create one based on my own data? Should it be something like this?
Labels features frequency
A 'book' 54
B 'movies' 32
Any help is appreciated.
There are many different machine learning algorithms we can choose from when doing text classification with machine learning. One of those is Support Vector Machines (or SVM).
So, we use SVM to mainly classify data but we can also use it for regression. It is a fast and dependable algorithm and works well with fewer data. A very simple definition would be that SVM is a supervised algorithm that classifies or separates data using hyperplanes.
Have a look at the documentation on text feature extraction.
Also have a look at the text classification example.
There is also a tutorial here:
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
In particular don't focus too much on SVM models (in particular not sklearn.svm.SVC
that is more interesting for kernel models hence not text classification): a simple Perceptron, LogisticRegression or Bernoulli naive Bayes models might work as good while being much faster to train.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With