I have scraped a lot of ebay titles like this one: <pre class="prettyprint"><code>Apple iPhone 5 White 16GB Dual-Core </code></pre> and I have manually tagged all of them in this way <pre class="prettyprint"><code>B M C S NA </code></pre> where B=Brand (Apple) M=Model (iPhone 5) C=Color (White) S=Size (Size) NA=Not Assigned (Dual Core) Now I need to train a SVM classifier using the libsvm library in python to learn the sequence patterns that occur in the ebay titles. I need to extract new value for that attributes (Brand, Model, Color, Size) by considering the problem as a classification one. In this way I can predict new models. I want to represent these features to use them as input to the libsvm library. I work in python :D. <blockquote> <ol> <li>Identity of the current word</li> </ol> </blockquote> I think that I can interpret it in this way <pre class="prettyprint"><code>0 --> Brand 1 --> Model 2 --> Color 3 --> Size 4 --> NA </code></pre> If I know that the word is a Brand I will set that variable to 1 (true). It is ok to do it in the training test (because I have tagged all the words) but how can I do that for the test set? I don't know what is the category of a word (this is why I'm learning it :D). <blockquote> <ol start="2"> <li>N-gram substring features of current word (N=4,5,6)</li> </ol> </blockquote> No Idea, what does it means? <blockquote> <ol start="3"> <li>Identity of 2 words before the current word.</li> </ol> </blockquote> How can I model this feature? Considering the legend that I create for the 1st feature I have 5^(5) combination) <pre class="prettyprint"><code>00 10 20 30 40 01 11 21 31 41 02 12 22 32 42 03 13 23 33 43 04 14 24 34 44 </code></pre> How can I convert it to a format that the libsvm (or scikit-learn) can understand? <pre class="prettyprint"><code>4. Membership to the 4 dictionaries of attributes </code></pre> Again how can I do it? Having 4 dictionaries (for color, size, model and brand) I thing that I must create a bool variable that I will set to true if and only if I have a match of the current word in one of the 4 dictionaries. <blockquote> <ol start="5"> <li>Exclusive membership to dictionary of brand names</li> </ol> </blockquote> I think that like in the 4. feature I must use a bool variable. Do you agree? If this question lacks some info please read my previous question at this address: Support vector machine in Python using libsvm example of features Last doubt: If I have a multi token value like iPhone 5... I must tag iPhone like a brand and 5 also like a brand or is better to tag {iPhone 5} all as a brand?? In the test dataset iPhone and 5 will be 2 separates word... so what is better to do?

The reason that the solution proposed to you in the previous question had Insufficient results (I assume) - is that the feature were poor for this problem. If I understand correctly, What you want is the following: given the sentence - <blockquote> Apple iPhone 5 White 16GB Dual-Core </blockquote> You to get- <blockquote> B M C S NA </blockquote> The problem you are describing here is equivalent to part of speech tagging (POS) in Natural Language Processing. Consider the following sentence in English: <blockquote> We saw the yellow dog </blockquote> The task of POS is giving the appropriate tag for each word. In this case: <blockquote> We(PRP) saw(VBD) the(DT) yellow(JJ) dog(NN) </blockquote> Don't invest time on understanding the tags in English here, since I give it here only to show you that your problem and POS are equal. Before I explain how to solve it using SVM, I want to make you aware of other approaches: consider the sentence <code>Apple iPhone 5 White 16GB Dual-Core</code> as test data. The tag you set to the word <code>Apple</code> must be given as input to the tagger when you are tagging the word <code>iPhone</code>. However, After you tag the word a word, you will not change it. Hence, models that are doing sequance tagging usually recievces better results. The easiest example is Hidden Markov Models (HMM). Here is a short intro to HMM in POS. Now we model this problem as classification problem. Lets define what is a window - <pre class="prettyprint"><code>`W-2,W-1,W0,W1,W2` </code></pre> Here, we have a window of size 2. When classifying the word <code>W0</code>, we will need the features of all the words in the window (concatenated). Please note that for the first word of the sentence we will use: <pre class="prettyprint"><code>`START-2,START-1,W0,W1,W2` </code></pre> In order to model the fact that this is the first word. for the second word we have: <pre class="prettyprint"><code>`START-1,W-1,W0,W1,W2` </code></pre> And similarly for the words at the end of the sentence. The tags <code>START-2</code>,<code>START-1</code>,<code>STOP1</code>,<code>STOP2</code> must be added to the model two. Now, Lets describe what are the features used for tagging W0: <pre class="prettyprint"><code>Features(W-2),Features(W-1),Features(W0),Features(W1),Features(W2) </code></pre> The features of a token should be the word itself, and the tag (given to the previous word). We shall use binary features. <h3>Example - how to build the feature representation:</h3> <h3> Step 1 - building the word representation for each token:</h3> Lets take a window size of 1. When classifying a token, we use <code>W-1,W0,W1</code>. Say you build a dictionary, and gave every word in the corpus a number: <pre class="prettyprint"><code>n['Apple'] = 0 n['iPhone 5'] = 1 n['White'] = 2 n['16GB'] = 3 n['Dual-Core'] = 4 n['START-1'] = 5 n['STOP1'] = 6 </code></pre> <h3> Step 2 - feature token for each tag:</h3> we create features for the following tags: <pre class="prettyprint"><code>n['B'] = 7 n['M'] = 8 n['C'] = 9 n['S'] = 10 n['NA'] = 11 n['START-1'] = 12 n['STOP1'] = 13 </code></pre> Lets build a feature vector for <code>START-1,Apple,iPhone 5</code>: the first token is a word with known tag (<code>START-1</code> will always have the tag <code>START-1</code>). So the features for this token are: <pre class="prettyprint"><code>(0,0,0,0,0,0,1,0,0,0,0,0,1,0) </code></pre> (The features that are 1: having the word <code>START-1</code>, and tag <code>START-1</code>) For the token <code>Apple</code>: <pre class="prettyprint"><code>(1,0,0,0,0,0,0) </code></pre> Note that we use already-calculated-tags feature for every word before W0 (since we have already calculated it) . Similarly, the features of the token <code>iPhone 5</code>: <pre class="prettyprint"><code>(0,1,0,0,0,0,0) </code></pre> <h3> Step 3 concatenate all the features:</h3> Generally, the features for 1-window will be: <pre class="prettyprint"><code>word(W-1),tag(W-1),word(W0),word(W1) </code></pre> Regarding your question - I would use one more tag - <code>number</code> - so that when you tag the word <code>5</code> (since you split the title by space), the feature <code>W0</code> will have a 1 on some number feature, and 1 in <code>W-1</code>'s <code>model</code> tag - in case the previous token was tagged correctly as model. <h3>To sum up, what you should do:</h3> <ol> <li>give a number to each word in the data</li> <li>build feature representation for the train data (using the tags you already calculated manually)</li> <li>train a model</li> <li>label the test data</li> </ol> <h3>Final Note - a Warm Tip For Existing Code:</h3> You can find POS tagger implemented in python here. It includes explanation of the problem and code, and it also does this feature extraction I just described for you. Additionally, they used <code>set</code> for representing the feature of each word, so the code is much simpler to read. The data this tagger receives should look like this: <pre class="prettyprint"><code>Apple_B iPhone_M 5_NUMBER White_C 16GB_S Dual-Core_NA </code></pre> The feature extraction is doing in this manner (see more at the link above): <pre class="prettyprint"><code>def get_features(i, word, context, prev): '''Map tokens-in-contexts into a feature representation, implemented as a set. If the features change, a new model must be trained.''' def add(name, *args): features.add('+'.join((name,) + tuple(args))) features = set() add('bias') # This acts sort of like a prior add('i suffix', word[-3:]) add('i-1 tag', prev) add('i word', context[i]) add('i-1 word', context[i-1]) add('i+1 word', context[i+1]) return features </code></pre> For the example above: <pre class="prettyprint"><code>context = ["Apple","iPhone","5","White","16GB","Dual-Core"] prev = "B" i = 1 word = "iPhone" </code></pre> Generally, <code>word</code> is the str of the current word, <code>context</code> is a the title split into list, and <code>prev</code> is the tag you received for the previous word. I use this code in the past, it works fast with great results. Hope its clear, Have fun tagging!

Some doubts modelling some features for the libsvm/scikit-learn library in python

Tags:

python

dictionary

scikit-learn

libsvm

I have scraped a lot of ebay titles like this one:

Apple iPhone 5 White 16GB Dual-Core

and I have manually tagged all of them in this way

B M C S NA

where B=Brand (Apple) M=Model (iPhone 5) C=Color (White) S=Size (Size) NA=Not Assigned (Dual Core)

Now I need to train a SVM classifier using the libsvm library in python to learn the sequence patterns that occur in the ebay titles.

I need to extract new value for that attributes (Brand, Model, Color, Size) by considering the problem as a classification one. In this way I can predict new models.

I want to represent these features to use them as input to the libsvm library. I work in python :D.

Identity of the current word

I think that I can interpret it in this way

0 --> Brand
1 --> Model
2 --> Color
3 --> Size 
4 --> NA

If I know that the word is a Brand I will set that variable to 1 (true). It is ok to do it in the training test (because I have tagged all the words) but how can I do that for the test set? I don't know what is the category of a word (this is why I'm learning it :D).

N-gram substring features of current word (N=4,5,6)

No Idea, what does it means?

Identity of 2 words before the current word.

How can I model this feature?

Considering the legend that I create for the 1st feature I have 5^(5) combination)

00 10 20 30 40
01 11 21 31 41
02 12 22 32 42
03 13 23 33 43
04 14 24 34 44

How can I convert it to a format that the libsvm (or scikit-learn) can understand?

4. Membership to the 4 dictionaries of attributes

Again how can I do it? Having 4 dictionaries (for color, size, model and brand) I thing that I must create a bool variable that I will set to true if and only if I have a match of the current word in one of the 4 dictionaries.

Exclusive membership to dictionary of brand names

I think that like in the 4. feature I must use a bool variable. Do you agree?

If this question lacks some info please read my previous question at this address: Support vector machine in Python using libsvm example of features

Last doubt: If I have a multi token value like iPhone 5... I must tag iPhone like a brand and 5 also like a brand or is better to tag {iPhone 5} all as a brand??

In the test dataset iPhone and 5 will be 2 separates word... so what is better to do?

620

asked Jun 28 '15 19:06

Usi Usi

1 Answers

The reason that the solution proposed to you in the previous question had Insufficient results (I assume) - is that the feature were poor for this problem.

If I understand correctly, What you want is the following:

given the sentence -

Apple iPhone 5 White 16GB Dual-Core

You to get-

B M C S NA

The problem you are describing here is equivalent to part of speech tagging (POS) in Natural Language Processing.

Consider the following sentence in English:

We saw the yellow dog

The task of POS is giving the appropriate tag for each word. In this case:

We(PRP) saw(VBD) the(DT) yellow(JJ) dog(NN)

Don't invest time on understanding the tags in English here, since I give it here only to show you that your problem and POS are equal.

Before I explain how to solve it using SVM, I want to make you aware of other approaches: consider the sentence Apple iPhone 5 White 16GB Dual-Core as test data. The tag you set to the word Apple must be given as input to the tagger when you are tagging the word iPhone. However, After you tag the word a word, you will not change it. Hence, models that are doing sequance tagging usually recievces better results. The easiest example is Hidden Markov Models (HMM). Here is a short intro to HMM in POS.

Now we model this problem as classification problem. Lets define what is a window -

`W-2,W-1,W0,W1,W2`

Here, we have a window of size 2. When classifying the word W0, we will need the features of all the words in the window (concatenated). Please note that for the first word of the sentence we will use:

`START-2,START-1,W0,W1,W2`

In order to model the fact that this is the first word. for the second word we have:

`START-1,W-1,W0,W1,W2`

And similarly for the words at the end of the sentence. The tags START-2,START-1,STOP1,STOP2 must be added to the model two.

Now, Lets describe what are the features used for tagging W0:

Features(W-2),Features(W-1),Features(W0),Features(W1),Features(W2)

The features of a token should be the word itself, and the tag (given to the previous word). We shall use binary features.

Example - how to build the feature representation:

Step 1 - building the word representation for each token:

Lets take a window size of 1. When classifying a token, we use W-1,W0,W1. Say you build a dictionary, and gave every word in the corpus a number:

n['Apple'] = 0
n['iPhone 5'] = 1
n['White'] = 2
n['16GB'] = 3
n['Dual-Core'] = 4
n['START-1'] = 5
n['STOP1'] = 6

Step 2 - feature token for each tag:

we create features for the following tags:

n['B'] = 7 
n['M'] = 8
n['C'] = 9 
n['S'] = 10 
n['NA'] = 11
n['START-1'] = 12
n['STOP1'] = 13

Lets build a feature vector for START-1,Apple,iPhone 5: the first token is a word with known tag (START-1 will always have the tag START-1). So the features for this token are:

(0,0,0,0,0,0,1,0,0,0,0,0,1,0)

(The features that are 1: having the word START-1, and tag START-1)

For the token Apple:

(1,0,0,0,0,0,0)

Note that we use already-calculated-tags feature for every word before W0 (since we have already calculated it) . Similarly, the features of the token iPhone 5:

(0,1,0,0,0,0,0)

Step 3 concatenate all the features:

Generally, the features for 1-window will be:

word(W-1),tag(W-1),word(W0),word(W1)

Regarding your question - I would use one more tag - number - so that when you tag the word 5 (since you split the title by space), the feature W0 will have a 1 on some number feature, and 1 in W-1's model tag - in case the previous token was tagged correctly as model.

To sum up, what you should do:

give a number to each word in the data
build feature representation for the train data (using the tags you already calculated manually)
train a model
label the test data

Final Note - a Warm Tip For Existing Code:

You can find POS tagger implemented in python here. It includes explanation of the problem and code, and it also does this feature extraction I just described for you. Additionally, they used set for representing the feature of each word, so the code is much simpler to read.

The data this tagger receives should look like this:

Apple_B iPhone_M 5_NUMBER White_C 16GB_S Dual-Core_NA

The feature extraction is doing in this manner (see more at the link above):

def get_features(i, word, context, prev):
    '''Map tokens-in-contexts into a feature representation, implemented as a
    set. If the features change, a new model must be trained.'''
    def add(name, *args):
        features.add('+'.join((name,) + tuple(args)))

    features = set()
    add('bias') # This acts sort of like a prior
    add('i suffix', word[-3:])
    add('i-1 tag', prev)
    add('i word', context[i])
    add('i-1 word', context[i-1])
    add('i+1 word', context[i+1])
    return features

For the example above:

context = ["Apple","iPhone","5","White","16GB","Dual-Core"]
prev = "B"
i = 1
word = "iPhone"

Generally, word is the str of the current word, context is a the title split into list, and prev is the tag you received for the previous word.

I use this code in the past, it works fast with great results. Hope its clear, Have fun tagging!

187

answered Sep 29 '22 10:09

omerbp

Related questions
                            
                                Python Enum _value2member_map_ Accessor?
                            
                                Pygame with Multiple Windows
                            
                                Is there a Python equivalent to the mahalanobis() function in R? If not, how can I implement it?
                            
                                Graph modularity in python networkx
                            
                                Python: Time input validation
                            
                                initialize child class with parent
                            
                                Where does Django store sessions?
                            
                                Python - Long string on multiple line
                            
                                namedtuple with unicode string as name
                            
                                Is there a way to use ribbon toolbars in Tkinter?
                            
                                Filter queryset by reverse exists check in Django
                            
                                python member str performance too slow
                            
                                Beautifulsoup: Getting a new line when I tried to access the soup.head.next_sibling value with Beautifulsoup4
                            
                                How can I plot two different spaced time series on one same plot in Python
                            
                                Reading files from disk in Python in Parallel
                            
                                Dynamic one-line output in Django management command
                            
                                404 when trying to create backup in a Google App Engine project
                            
                                Parse logs containing python tracebacks using logstash
                            
                                Scrapy get all children / ignore <br>?
                            
                                ipython notebook terminals unavailable [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With