which is best svm example which classifies plain input text?

Tags:

I have checked various svm classification tools, mainly svmlight, pysvmlight, libsvm, scikit learn svm classifier.

Each take input test file in some different format like

pysvmlight:

[(0, [(13.0, 1.0), (14.0, 1.0), (173.0, 1.0), (174.0, 1.0)]),
 (0,
  [(9.0, 1.0),
   (10.0, 1.0),
   (11.0, 1.0),
   (12.0, 1.0),
   (16.0, 1.0),
   (19.0, 1.0),
   (20.0, 1.0),
   (21.0, 1.0),
   (22.0, 1.0),
   (56.0, 1.0)]

svmlight

+1 6:0.0342598670723747 26:0.148286149621374 27:0.0570037235976456 31:0.0373086482671729 33:0.0270832794680822 63:0.0317368459004657 67:0.138424991237843 75:0.0297571881179897 96:0.0303237495966756 142:0.0241139382095992 144:0.0581948804675796 185:0.0285004985793364 199:0.0228776475252599 208:0.0366675566391316 274:0.0528930062061687 308:0.0361623318128513 337:0.0374174808347037 351:0.0347329937800643 387:0.0690970538458777 408:0.0288195477724883 423:0.0741629177979597 480:0.0719961218888683 565:0.0520577748209694 580:0.0442849093862884 593:0.329982711875242 598:0.0517245325094578 613:0.0452655621746453 641:0.0387269206869957 643:0.0398205809532254 644:0.0466353065571088 657:0.0508331832990127 717:0.0495981406619795 727:0.104798994968809 764:0.0452655621746453 827:0.0418050310923008 1027:0.05114477444793 1281:0.0633241153685135 1340:0.0657101916402099 1395:0.0522617631894159 1433:0.0471872599750513 1502:0.840963375098259 1506:0.0686138465829187 1558:0.0589627036028818 1598:0.0512079697459134 1726:0.0660884976719923 1836:0.0521934221969394 1943:0.0587388821544177 2433:0.0666767220421155 2646:0.0729483627336339 2731:0.071437898589286 2771:0.0706069752753547 3553:0.0783933439550538 3589:0.0774668403369963

http://svm.chibi.ubc.ca//sample.test.matrix.txt

corner  feature_1   feature_2   feature_3   feature_4
example_11  -0.18   0.14    -0.06   0.54
example_12  0.16    -0.25   0.26    0.33
example_13  0.06    0.0 -0.2    -0.22
example_14  -0.12   -0.22   0.29    -0.01
example_15  -0.20   -0.23   -0.1    -0.71

IS there any svm classifier which takes plain input text and give classification result for it?

403

asked Oct 09 '14 12:10

puncrazy

2 Answers

My answer is two fold

There are SVM implementations which work directly on text data, e.g., https://github.com/timshenkao/StringKernelSVM. Also LIBSVM isable to http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#libsvm_for_string_data. The key towards directly using SVM on text data are so called String Kernel. A kernel is used within a SVM to measure distance between the different data points, which are the text documents. One example for a String kernel is edit distance between the different text documents, c.f., http://www.jmlr.org/papers/volume2/lodhi02a/lodhi02a.pdf

The question is whether this is a good idea for using a text kernel for text classfication.

Simplifying the support vector machine is a function

f(x) = sgn( <w,phi(x)> +b)

What typically happens is that you take your input document, calculate the bags of words representation for these and then take a standard kernel like linear. So something like:

f(x) = sgn( <w,phi(bag-of-words(x))> +b)

What you most likely want is a SVM with a kernel which combines bag of words with linear kernel. This is implementation wise easy but has drawbacks

Bags of words is very compact compared to text documents
You can not normalize text documents for length, but you can do feature normalization on bag of words
Not seperating these steps makes your code harder to reuse

Bottomline of both parts: it is not about the SVM it is about the kernel.

150

answered Nov 15 '22 08:11

CAFEBABE

Yes, you can do this in scikit-learn.

First, use CountVectorizer to convert your text documents into a document-term matrix. (This is known as the "bag of words" representation, and is one way to extract features from text.) The document-term matrix is used as your input to a Support Vector Machine, or any other classification model.

Here is a brief description of the document-term matrix, from the scikit-learn documentation:

In this scheme, features and samples are defined as follows: Each individual token occurrence frequency (normalized or not) is treated as a feature. The vector of all the token frequencies for a given document is considered a multivariate sample.

However, using a Support Vector Machine (SVM) may not be the best idea in this case. From the scikit-learn documentation:

If the number of features is much greater than the number of samples, the method is likely to give poor performances.

Typically, a document-term matrix has far more features (unique terms) than samples (documents), and thus SVMs are typically not the optimal choice for this type of problem.

Here is a lesson notebook explaining and demonstrating this entire process in scikit-learn, although it uses a different classification model (Naive Bayes).

answered Nov 15 '22 09:11

Kevin Markham

Related questions
                            
                                Mallet CRF SimpleTagger Performance Tuning
                            
                                WEKA: How to filter multiple attribute ranges?
                            
                                Discovering "templates" in a given text?
                            
                                Histogram approximation for streaming data
                            
                                Basic understanding of the Adaboost algorithm
                            
                                What are the advantages or disadvantages of having multiple output nodes compared to a few within a neural network
                            
                                Implementations of local regression and local likelihood methods
                            
                                Implementing Support Vector Machine - EFFICIENTLY computing gram-matrix K
                            
                                How to train image (pixel) data in libsvm format to use for recognition with Java
                            
                                scikit learn clf.fit / score model accuracy
                            
                                SVM - relation between the number of training samples and the number of features
                            
                                Rescaling after feature scaling, linear regression
                            
                                Binning of continuous variables in sklearn ensemble and trees
                            
                                Gaussian-RBM fails on a trivial example
                            
                                Python: String clustering with scikit-learn's dbscan, using Levenshtein distance as metric:
                            
                                Using sklearn cross_val_score and kfolds to fit and help predict model
                            
                                Categorical features correlation
                            
                                When to use a certain Reinforcement Learning algorithm?
                            
                                Want to know the diff among pd.factorize, pd.get_dummies, sklearn.preprocessing.LableEncoder and OneHotEncoder [closed]
                            
                                How to map features from the output of a VectorAssembler back to the column names in Spark ML?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

which is best svm example which classifies plain input text?

Tags:

machine-learning

classification

svm

scikit-learn

puncrazy

People also ask

2 Answers

CAFEBABE

Kevin Markham

Recent Activity

Donate For Us