How do I do use non-integer string labels with SVM from scikit-learn? Python

Tags:

Scikit-learn has fairly user-friendly python modules for machine learning.

I am trying to train an SVM tagger for Natural Language Processing (NLP) where my labels and input data are words and annotation. E.g. Part-Of-Speech tagging, rather than using double/integer data as input tuples [[1,2], [2,0]], my tuples will look like this [['word','NOUN'], ['young', 'adjective']]

Can anyone give an example of how i can use the SVM with string tuples? the tutorial/documentation given here are for integer/double inputs. http://scikit-learn.org/stable/modules/svm.html

895

asked Oct 18 '12 02:10

alvas

2 Answers

Most machine learning algorithm process input samples that are vector of floats such that a small (often euclidean) distance between a pair of samples means that the 2 samples are similar in a way that is relevant for the problem at hand.

It is the responsibility of the machine learning practitioner to find a good set of float features to encode. This encoding is domain specific hence there is not general way to build that representation out of the raw data that would work across all application domains (various NLP tasks, computer vision, transaction log analysis...). This part of the machine learning modeling work is called feature extraction. When it involves a lot of manual work, this is often referred to as feature engineering.

Now for your specific problem, POS tags of a window of words around a word of interest in a sentence (e.g. for sequence tagging such as named entity detection) can be encoded appropriately by using the DictVectorizer feature extraction helper class of scikit-learn.

answered Nov 14 '22 21:11

ogrisel

This is not so much a scikit or python question, but more of a general issue with SVMs.

Data instances in SVMs must be be represented as vectors of scalars of sorts, typically, real numbers. Categorical Attributes must therefore first be mapped to some numeric values before they can be included in SVMs.

Some categorical attributes lend themselves more naturally/logically to be mapped onto some scale (some loose "metric"). For example a (1, 2, 3, 5) mapping for a Priority field with values of ('no rush', 'standard delivery', 'Urgent' and 'Most Urgent') may make sense. Another example may be with colors which can be mapped to 3 dimensions one each for their Red, Green, Blue components etc.
Other attributes don't have a semantic that allows any even approximate logical mapping onto a scale; the various values for these attributes must then be assigned an arbitrary numeric value on one (or possibly several) dimension(s) of the SVM. Understandingly if an SVM has many of these arbitrary "non metric" dimensions, it can be less efficient at properly classifying items, because the distance computations and clustering logic implicit to the working of the SVMs are less semantically related.

This observation doesn't mean that SVMs cannot be used at all when the items include non numeric or non "metric" dimensions, but it is certainly a reminder that feature selection and feature mapping are very sensitive parameters of classifiers in general and SVM in particular.

In the particular case of POS-tagging... I'm afraid I'm stumped at the moment, on which attributes of the labelled corpus to use and on how to map these to numeric values. I know that SVMTool can produce very efficient POS-taggers, using SVMs, and also several scholarly papers describe taggers also based on SVMs. However I'm more familiar with the other approaches to tagging (e.g. with HMMs or Maximum Entropy.)

answered Nov 14 '22 23:11

mjv

Related questions
                            
                                How can you parse a document stored in the MARC21 format with Python
                            
                                How to escape “\” characters in python
                            
                                Expand alphabetical range to list of characters in Python
                            
                                Removing spaces and empty lines from a file Using Python
                            
                                Should I write dict or {} in Python when constructing a dictionary with string keys?
                            
                                Testing a condition that doesn't change inside a loop
                            
                                How do I write a regex to replace a word but keep its case in Python?
                            
                                How do I create a static framed ASCII interface in Python?
                            
                                Complex sort with multiple parameters?
                            
                                How to convert binary string to ascii string in python? [duplicate]
                            
                                python byRef // copy
                            
                                Python regex for matching two or three white spaces
                            
                                Some confusion regarding imports in Python
                            
                                Is type the super class of all classes in Python?
                            
                                Does, With open() not works with python 2.6
                            
                                How to get all children of queryset in django?
                            
                                What does the underscore represent in Python?
                            
                                dealing with arrays: how to avoid a "for" statement
                            
                                Redirect to index page after submiting form in django
                            
                                getattr vs. inspect.getmembers

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I do use non-integer string labels with SVM from scikit-learn? Python

Tags:

python

nlp

svm

scikit-learn

pos-tagger

alvas

People also ask

2 Answers

ogrisel

mjv

Recent Activity

Donate For Us