sklearn: How to speed up a vectorizer (eg Tfidfvectorizer)

Tags:

After thoroughly profiling my program, I have been able to pinpoint that it is being slowed down by the vectorizer.

I am working on text data, and two lines of simple tfidf unigram vectorization is taking up 99.2% of the total time the code takes to execute.

Here is a runnable example (this will download a 3mb training file to your disk, omit the urllib parts to run on your own sample):

#####################################
# Loading Data
#####################################
import urllib
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk.stem  
raw = urllib.urlopen("https://s3.amazonaws.com/hr-testcases/597/assets/trainingdata.txt").read()
file = open("to_delete.txt","w").write(raw)
###
def extract_training():
    f = open("to_delete.txt")
    N = int(f.readline())
    X = []
    y = []
    for i in xrange(N):
        line  = f.readline()
        label,text =  int(line[0]), line[2:]
        X.append(text)
        y.append(label)
    return X,y
X_train, y_train =  extract_training()    
#############################################
# Extending Tfidf to have only stemmed features
#############################################
english_stemmer = nltk.stem.SnowballStemmer('english')

class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))

tfidf = StemmedTfidfVectorizer(min_df=1, stop_words='english', analyzer='word', ngram_range=(1,1))
#############################################
# Line below takes 6-7 seconds on my machine
#############################################
Xv = tfidf.fit_transform(X_train)

I tried converting the list X_train into an np.array but there was no difference in performance.

519

asked Oct 04 '14 18:10

pad

1 Answers

Unsurprisingly, it's NLTK that is slow:

>>> tfidf = StemmedTfidfVectorizer(min_df=1, stop_words='english', analyzer='word', ngram_range=(1,1))
>>> %timeit tfidf.fit_transform(X_train)
1 loops, best of 3: 4.89 s per loop
>>> tfidf = TfidfVectorizer(min_df=1, stop_words='english', analyzer='word', ngram_range=(1,1))
>>> %timeit tfidf.fit_transform(X_train)
1 loops, best of 3: 415 ms per loop

You can speed this up by using a smarter implementation of the Snowball stemmer, e.g., PyStemmer:

>>> import Stemmer
>>> english_stemmer = Stemmer.Stemmer('en')
>>> class StemmedTfidfVectorizer(TfidfVectorizer):
...     def build_analyzer(self):
...         analyzer = super(TfidfVectorizer, self).build_analyzer()
...         return lambda doc: english_stemmer.stemWords(analyzer(doc))
...     
>>> tfidf = StemmedTfidfVectorizer(min_df=1, stop_words='english', analyzer='word', ngram_range=(1,1))
>>> %timeit tfidf.fit_transform(X_train)
1 loops, best of 3: 650 ms per loop

NLTK is a teaching toolkit. It's slow by design, because it's optimized for readability.

106

answered Sep 21 '22 03:09

Fred Foo

Related questions
                            
                                pandas dataframe as field in django
                            
                                How to import standard library module instead of local directory?
                            
                                Python 3.x - Getting the state of caps-lock/num-lock/scroll-lock on Windows
                            
                                Cython Numpy code not faster than pure python
                            
                                how to get python print result in jenkins console output
                            
                                Is this the cleanest way to write a long list in Python?
                            
                                flask-sqlalchemy max value of column
                            
                                Parallelization of PyMC
                            
                                Different Results in Go and Pycrypto when using AES-CFB
                            
                                PyGame Custom Event
                            
                                How to list available tests with python?
                            
                                Boolean comparison of two Series objects
                            
                                Python Bottle and Cache-Control
                            
                                Django Rest Framework SerializerMethodField Pass Extra Argument
                            
                                COUNTIF in pandas python over multiple columns with multiple conditions
                            
                                Python pandas : pd.options.display.mpl_style = 'default' causes graph crash
                            
                                Log scale using Bokeh's scatter function
                            
                                python word2vec not installing
                            
                                How to tell Homebrew to install inside virtualenv?
                            
                                Python cant convert 'list' object to str error [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

sklearn: How to speed up a vectorizer (eg Tfidfvectorizer)

Tags:

python

nltk

scikit-learn

pad

People also ask

1 Answers

Fred Foo

Recent Activity

Donate For Us