Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I do use non-integer string labels with SVM from scikit-learn? Python

Scikit-learn has fairly user-friendly python modules for machine learning.

I am trying to train an SVM tagger for Natural Language Processing (NLP) where my labels and input data are words and annotation. E.g. Part-Of-Speech tagging, rather than using double/integer data as input tuples [[1,2], [2,0]], my tuples will look like this [['word','NOUN'], ['young', 'adjective']]

Can anyone give an example of how i can use the SVM with string tuples? the tutorial/documentation given here are for integer/double inputs. http://scikit-learn.org/stable/modules/svm.html

like image 895
alvas Avatar asked Oct 18 '12 02:10

alvas


People also ask

How does Scikit learn SVM work?

How Does Sklearn SVM Work? In order to construct a hyperplane, SVM uses extreme data points (vectors), which are referred to as support vectors. The SVM algorithm's main goal is to find an ideal hyperplane with a large margin that can create discrete classes by dividing it into an n-dimensional space.

How do I get support vectors in SVM?

According to the SVM algorithm we find the points closest to the line from both the classes. These points are called support vectors. Now, we compute the distance between the line and the support vectors. This distance is called the margin.

Why use SVM?

SVMs are used in applications like handwriting recognition, intrusion detection, face detection, email classification, gene classification, and in web pages. This is one of the reasons we use SVMs in machine learning. It can handle both classification and regression on linear and non-linear data.


2 Answers

Most machine learning algorithm process input samples that are vector of floats such that a small (often euclidean) distance between a pair of samples means that the 2 samples are similar in a way that is relevant for the problem at hand.

It is the responsibility of the machine learning practitioner to find a good set of float features to encode. This encoding is domain specific hence there is not general way to build that representation out of the raw data that would work across all application domains (various NLP tasks, computer vision, transaction log analysis...). This part of the machine learning modeling work is called feature extraction. When it involves a lot of manual work, this is often referred to as feature engineering.

Now for your specific problem, POS tags of a window of words around a word of interest in a sentence (e.g. for sequence tagging such as named entity detection) can be encoded appropriately by using the DictVectorizer feature extraction helper class of scikit-learn.

like image 85
ogrisel Avatar answered Nov 14 '22 21:11

ogrisel


This is not so much a scikit or python question, but more of a general issue with SVMs.

Data instances in SVMs must be be represented as vectors of scalars of sorts, typically, real numbers. Categorical Attributes must therefore first be mapped to some numeric values before they can be included in SVMs.

Some categorical attributes lend themselves more naturally/logically to be mapped onto some scale (some loose "metric"). For example a (1, 2, 3, 5) mapping for a Priority field with values of ('no rush', 'standard delivery', 'Urgent' and 'Most Urgent') may make sense. Another example may be with colors which can be mapped to 3 dimensions one each for their Red, Green, Blue components etc.
Other attributes don't have a semantic that allows any even approximate logical mapping onto a scale; the various values for these attributes must then be assigned an arbitrary numeric value on one (or possibly several) dimension(s) of the SVM. Understandingly if an SVM has many of these arbitrary "non metric" dimensions, it can be less efficient at properly classifying items, because the distance computations and clustering logic implicit to the working of the SVMs are less semantically related.

This observation doesn't mean that SVMs cannot be used at all when the items include non numeric or non "metric" dimensions, but it is certainly a reminder that feature selection and feature mapping are very sensitive parameters of classifiers in general and SVM in particular.

In the particular case of POS-tagging... I'm afraid I'm stumped at the moment, on which attributes of the labelled corpus to use and on how to map these to numeric values. I know that SVMTool can produce very efficient POS-taggers, using SVMs, and also several scholarly papers describe taggers also based on SVMs. However I'm more familiar with the other approaches to tagging (e.g. with HMMs or Maximum Entropy.)

like image 23
mjv Avatar answered Nov 14 '22 23:11

mjv