I have checked various svm classification tools, mainly svmlight, pysvmlight, libsvm, scikit learn svm classifier.
Each take input test file in some different format like
pysvmlight:
[(0, [(13.0, 1.0), (14.0, 1.0), (173.0, 1.0), (174.0, 1.0)]),
(0,
[(9.0, 1.0),
(10.0, 1.0),
(11.0, 1.0),
(12.0, 1.0),
(16.0, 1.0),
(19.0, 1.0),
(20.0, 1.0),
(21.0, 1.0),
(22.0, 1.0),
(56.0, 1.0)]
svmlight
+1 6:0.0342598670723747 26:0.148286149621374 27:0.0570037235976456 31:0.0373086482671729 33:0.0270832794680822 63:0.0317368459004657 67:0.138424991237843 75:0.0297571881179897 96:0.0303237495966756 142:0.0241139382095992 144:0.0581948804675796 185:0.0285004985793364 199:0.0228776475252599 208:0.0366675566391316 274:0.0528930062061687 308:0.0361623318128513 337:0.0374174808347037 351:0.0347329937800643 387:0.0690970538458777 408:0.0288195477724883 423:0.0741629177979597 480:0.0719961218888683 565:0.0520577748209694 580:0.0442849093862884 593:0.329982711875242 598:0.0517245325094578 613:0.0452655621746453 641:0.0387269206869957 643:0.0398205809532254 644:0.0466353065571088 657:0.0508331832990127 717:0.0495981406619795 727:0.104798994968809 764:0.0452655621746453 827:0.0418050310923008 1027:0.05114477444793 1281:0.0633241153685135 1340:0.0657101916402099 1395:0.0522617631894159 1433:0.0471872599750513 1502:0.840963375098259 1506:0.0686138465829187 1558:0.0589627036028818 1598:0.0512079697459134 1726:0.0660884976719923 1836:0.0521934221969394 1943:0.0587388821544177 2433:0.0666767220421155 2646:0.0729483627336339 2731:0.071437898589286 2771:0.0706069752753547 3553:0.0783933439550538 3589:0.0774668403369963
http://svm.chibi.ubc.ca//sample.test.matrix.txt
corner feature_1 feature_2 feature_3 feature_4
example_11 -0.18 0.14 -0.06 0.54
example_12 0.16 -0.25 0.26 0.33
example_13 0.06 0.0 -0.2 -0.22
example_14 -0.12 -0.22 0.29 -0.01
example_15 -0.20 -0.23 -0.1 -0.71
IS there any svm classifier which takes plain input text and give classification result for it?
With their ability to generalize well in high dimensional feature spaces, SVMs eliminate the need for feature selection, making the ap- plication of text categorization considerably easier. Another advantage of SVMs over the conventional methods is their robustness.
There are many different machine learning algorithms we can choose from when doing text classification with machine learning. One of those is Support Vector Machines (or SVM).
Support Vector Machine (SVM) is a simple supervised machine algorithm used for classification and regression purposes. What SVM does is tit SVM finds a hyperplane that creates a boundary between two classes of data to classify them.
Explanation: Support vector machines is a supervised machine learning algorithm which works both on classification and regression problems. It tries to classify data by finding a hyperplane that maximizes the margin between the classes in the training data. Hence, SVM is an example of a large margin classifier.
My answer is two fold
There are SVM implementations which work directly on text data, e.g., https://github.com/timshenkao/StringKernelSVM. Also LIBSVM isable to http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#libsvm_for_string_data. The key towards directly using SVM on text data are so called String Kernel. A kernel is used within a SVM to measure distance between the different data points, which are the text documents. One example for a String kernel is edit distance between the different text documents, c.f., http://www.jmlr.org/papers/volume2/lodhi02a/lodhi02a.pdf
The question is whether this is a good idea for using a text kernel for text classfication.
Simplifying the support vector machine is a function
f(x) = sgn( <w,phi(x)> +b)
What typically happens is that you take your input document, calculate the bags of words representation for these and then take a standard kernel like linear. So something like:
f(x) = sgn( <w,phi(bag-of-words(x))> +b)
What you most likely want is a SVM with a kernel which combines bag of words with linear kernel. This is implementation wise easy but has drawbacks
Bottomline of both parts: it is not about the SVM it is about the kernel.
Yes, you can do this in scikit-learn.
First, use CountVectorizer to convert your text documents into a document-term matrix. (This is known as the "bag of words" representation, and is one way to extract features from text.) The document-term matrix is used as your input to a Support Vector Machine, or any other classification model.
Here is a brief description of the document-term matrix, from the scikit-learn documentation:
In this scheme, features and samples are defined as follows: Each individual token occurrence frequency (normalized or not) is treated as a feature. The vector of all the token frequencies for a given document is considered a multivariate sample.
However, using a Support Vector Machine (SVM) may not be the best idea in this case. From the scikit-learn documentation:
If the number of features is much greater than the number of samples, the method is likely to give poor performances.
Typically, a document-term matrix has far more features (unique terms) than samples (documents), and thus SVMs are typically not the optimal choice for this type of problem.
Here is a lesson notebook explaining and demonstrating this entire process in scikit-learn, although it uses a different classification model (Naive Bayes).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With