Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is pos_tag() so painfully slow and can this be avoided?

Tags:

python

nltk

I want to be able to get POS-Tags of sentences one by one like in this manner:

def __remove_stop_words(self, tokenized_text, stop_words):

    sentences_pos = nltk.pos_tag(tokenized_text)  
    filtered_words = [word for (word, pos) in sentences_pos 
                      if pos not in stop_words and word not in stop_words]

    return filtered_words

But the problem is that pos_tag() takes about a second for each sentence. There is another option to use pos_tag_sents() to do this batch-wise and speed things up. But my life would be easier if I could do this sentence by sentence.

Is there a way to do this faster?

like image 286
Stefan Falk Avatar asked Nov 20 '15 14:11

Stefan Falk


1 Answers

For nltk version 3.1, inside nltk/tag/__init__.py, pos_tag is defined like this:

from nltk.tag.perceptron import PerceptronTagger
def pos_tag(tokens, tagset=None):
    tagger = PerceptronTagger()
    return _pos_tag(tokens, tagset, tagger)    

So each call to pos_tag first instantiates PerceptronTagger which takes some time because it involves loading a pickle file. _pos_tag simply calls tagger.tag when tagset is None. So you can save some time by loading the file once, and calling tagger.tag yourself instead of calling pos_tag:

from nltk.tag.perceptron import PerceptronTagger
tagger = PerceptronTagger() 
def __remove_stop_words(self, tokenized_text, stop_words, tagger=tagger):
    sentences_pos = tagger.tag(tokenized_text)  
    filtered_words = [word for (word, pos) in sentences_pos 
                      if pos not in stop_words and word not in stop_words]

    return filtered_words

pos_tag_sents uses the same trick as above -- it instantiates PerceptronTagger once before calling _pos_tag many times. So you'll get a comparable gain in performance using the above code as you would by refactoring and calling pos_tag_sents.


Also, if stop_words is a long list, you may save a bit of time by making stop_words a set:

stop_words = set(stop_words)

since checking membership in a set (e.g. pos not in stop_words) is a O(1) (constant time) operation while checking membership in a list is a O(n) operation (i.e. it requires time which grows proportionally to the length of the list.)

like image 116
unutbu Avatar answered Oct 14 '22 14:10

unutbu