Prepare data for text classification using Scikit Learn SVM

Tags:

I'm trying to apply SVM from Scikit learn to classify the tweets I collected. So, there will be two categories, name them A and B. For now, I have all the tweets categorized in two text file, 'A.txt' and 'B.txt'. However, I'm not sure what type of data inputs the Scikit Learn SVM is asking for. I have a dictionary with labels (A and B) as its keys and a dictionary of features (unigrams) and their frequencies as values. Sorry, I'm really new to machine learning and not sure what I should do to get the SVM work. And I found that SVM is using numpy.ndarray as the type of its data input. Do I need to create one based on my own data? Should it be something like this?

Labels    features    frequency
  A        'book'        54
  B       'movies'       32

Any help is appreciated.

825

asked Dec 18 '12 22:12

user1906856

1 Answers

Have a look at the documentation on text feature extraction.

Also have a look at the text classification example.

There is also a tutorial here:

http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

In particular don't focus too much on SVM models (in particular not sklearn.svm.SVC that is more interesting for kernel models hence not text classification): a simple Perceptron, LogisticRegression or Bernoulli naive Bayes models might work as good while being much faster to train.

154

answered Oct 23 '22 09:10

ogrisel

Related questions
                            
                                Get Request Headers for Urllib2.Request?
                            
                                Create list by repeated application of function
                            
                                Uninterpreted strings in YAML
                            
                                Is there a package that maintains a list all currencies with symbols?
                            
                                Why is ElementTree raising a ParseError?
                            
                                Blog excerpt in Django
                            
                                Python3 subprocess communicate example
                            
                                Django - Grouping querysets by a certain field in template
                            
                                Is it possible to use "exe installers" with pip?
                            
                                Naming convention for descriptors
                            
                                Can't get Flask running using Passenger WSGI on Dreamhost shared hosting
                            
                                os.path.getsize Returns Incorrect Value?
                            
                                group list of ints by continuous sequence
                            
                                Is it possible to iterate through all nodes with py2neo
                            
                                Storing Python objects in a Python list vs. a fixed-length Numpy array
                            
                                Python function for capping a string to a maximum length
                            
                                using gen.task with Tornado for a simple function
                            
                                Set the default to false if another mutually exclusive argument is true
                            
                                PyCharm autocomplete
                            
                                Python module "cx_Oracle" module could not be found

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Prepare data for text classification using Scikit Learn SVM

Tags:

python

svm

scikit-learn

user1906856

People also ask

1 Answers

ogrisel

Recent Activity

Donate For Us