I want to use part of speech (POS) returned from nltk.pos_tag for sklearn classifier, How can I convert them to vector and use it? e.g. sent = "This is POS example" tok=nltk.tokenize.word_tokenize(sent) pos=nltk.pos_tag(tok) print (pos) This returns following [('This', 'DT'), ('is', 'VBZ'), ('POS', 'NNP'), ('example', 'NN')] Now I am unable to apply any of the vectorizer (DictVectorizer, or FeatureHasher, CountVectorizer from scikitlearn to use in classifier Pls suggest

I know this is a bit late, but gonna add an answer here. Depending on what features you want, you'll need to encode the POST in a way that makes sense. I've had the best results with SVM classification using ngrams when I glue the original sentence to the POST sentence so that it looks like the following: <pre class="prettyprint"><code>word1 word2 word3 ... wordn POST1 POST2 POST3... POSTn </code></pre> Once this is done, I feed it into a standard ngram or whatever else and feed that into the SVM. This method keeps the information of the individual words, but also keeps the vital information of POST patterns when you give your system a words it hasn't seen before but that the tagger has encountered before.

python: How to use POS (part of speech) features in scikit learn classfiers (SVM) etc

2 Answers

I know this is a bit late, but gonna add an answer here.

Depending on what features you want, you'll need to encode the POST in a way that makes sense. I've had the best results with SVM classification using ngrams when I glue the original sentence to the POST sentence so that it looks like the following:

word1 word2 word3 ... wordn POST1 POST2 POST3... POSTn

Once this is done, I feed it into a standard ngram or whatever else and feed that into the SVM.

This method keeps the information of the individual words, but also keeps the vital information of POST patterns when you give your system a words it hasn't seen before but that the tagger has encountered before.

114

answered Sep 18 '22 00:09

ABC

If I'm understanding you right, this is a bit tricky. Once you tag it, your sentence (or document, or whatever) is no longer composed of words, but of pairs (word + tag), and it's not clear how to make the most useful vector-of-scalars out of that.

Most text vectorizers do something like counting how many times each vocabulary item occurs, and then making a feature for each one:

 the: 4, player: 1, bats: 1, well: 2, today: 3,...

The next document might have:

 the: 0, quick:5, flying:3, bats:1, caught:1, bugs:2

Both can be stored as arrays of integers so long as you always put the same key in the same array element (you'll have a lot of zeros for most documents) -- or as a dict. So a vectorizer does that for many "documents", and then works on that.

So your question boils down to how to turn a list of pairs into a flat list of items that the vectorizors can count.

The most trivial way is to flatten your data to

('This', 'POS_DT', 'is', 'POS_VBZ', 'POS', 'POS_NNP', 'example', 'POS_NN')

The usual counting would then get a vector of 8 vocabulary items, each occurring once. I renamed the tags to make sure they can't get confused with words.

That would get you up and running, but it probably wouldn't accomplish much. That's because just knowing how many occurrences of each part of speech there are in a sample may not tell you what you need -- notice that any notion of which parts of speech go with which words is gone after the vectorizer does its counting.

Running a classifier on that may have some value if you're trying to distinguish something like style -- fiction may have more adjectives, lab reports may have fewer proper names (maybe), and so on.

Instead, you could change your data to

('This_DT', 'is_VBZ', 'POS_NNP', 'example_NN')

That keeps each tag "tied" to the word it belongs with, so now the vectors will be able to distinguish samples where "bat" is used as a verbs, from samples where it's only used as a noun. That would tell you slightly different things -- for example, "bat" as a verb is more likely in texts about baseball than in texts about zoos.

And there are many other arrangements you could do.

To get good results from using vector methods on natural language text, you will likely need to put a lot of thought (and testing) into just what features you want the vectorizer to generate and use. It depends heavily on what you're trying to accomplish in the end.

Hope that helps.

answered Sep 17 '22 00:09

TextGeek

Related questions
                            
                                Scrapy : storing the data
                            
                                Specifying object property in dot notation with a variable
                            
                                pil draw text with different colors
                            
                                Django Serving a Download File
                            
                                Getting <generator object <genexpr>
                            
                                string.lower in Python 3
                            
                                Python Pandas - Removing Rows From A DataFrame Based on a Previously Obtained Subset
                            
                                Python vs Iron Python [closed]
                            
                                perform varimax rotation in python using numpy
                            
                                Where shall I put my self-written Python packages?
                            
                                Convert generator object to a dictionary
                            
                                Why is numpy's einsum slower than numpy's built-in functions?
                            
                                What more do I need to do to have Django's @login_required decorator work?
                            
                                How to execute a python script and write output to txt file?
                            
                                Convert string into a function call
                            
                                Scan complete directory tree using pep8
                            
                                Divide an image into 5x5 blocks in python and compute histogram for each block
                            
                                Runtime error:App registry isn't ready yet
                            
                                How do I refer to the index of my Pandas dataframe?
                            
                                How do we delete a shape that's already been created in Tkinter canvas?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

python: How to use POS (part of speech) features in scikit learn classfiers (SVM) etc

Tags:

python

machine-learning

nltk

scikit-learn

Suresh Mali

People also ask

2 Answers

ABC

TextGeek

Recent Activity

Donate For Us