After thoroughly profiling my program, I have been able to pinpoint that it is being slowed down by the vectorizer.
I am working on text data, and two lines of simple tfidf unigram vectorization is taking up 99.2% of the total time the code takes to execute.
Here is a runnable example (this will download a 3mb training file to your disk, omit the urllib parts to run on your own sample):
#####################################
# Loading Data
#####################################
import urllib
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk.stem
raw = urllib.urlopen("https://s3.amazonaws.com/hr-testcases/597/assets/trainingdata.txt").read()
file = open("to_delete.txt","w").write(raw)
###
def extract_training():
f = open("to_delete.txt")
N = int(f.readline())
X = []
y = []
for i in xrange(N):
line = f.readline()
label,text = int(line[0]), line[2:]
X.append(text)
y.append(label)
return X,y
X_train, y_train = extract_training()
#############################################
# Extending Tfidf to have only stemmed features
#############################################
english_stemmer = nltk.stem.SnowballStemmer('english')
class StemmedTfidfVectorizer(TfidfVectorizer):
def build_analyzer(self):
analyzer = super(TfidfVectorizer, self).build_analyzer()
return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))
tfidf = StemmedTfidfVectorizer(min_df=1, stop_words='english', analyzer='word', ngram_range=(1,1))
#############################################
# Line below takes 6-7 seconds on my machine
#############################################
Xv = tfidf.fit_transform(X_train)
I tried converting the list X_train
into an np.array but there was no difference in performance.
TF-IDF is better than Count Vectorizers because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions.
The main difference between the 2 implementations is that TfidfVectorizer performs both term frequency and inverse document frequency for you, while using TfidfTransformer will require you to use the CountVectorizer class from Scikit-Learn to perform Term Frequency.
With Tfidftransformer you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores. With Tfidfvectorizer on the contrary, you will do all three steps at once.
TF-IDF is one of the most popular text vectorizers, the calculation is very simple and easy to understand. It gives the rare term high weight and gives the common term low weight.
Unsurprisingly, it's NLTK that is slow:
>>> tfidf = StemmedTfidfVectorizer(min_df=1, stop_words='english', analyzer='word', ngram_range=(1,1))
>>> %timeit tfidf.fit_transform(X_train)
1 loops, best of 3: 4.89 s per loop
>>> tfidf = TfidfVectorizer(min_df=1, stop_words='english', analyzer='word', ngram_range=(1,1))
>>> %timeit tfidf.fit_transform(X_train)
1 loops, best of 3: 415 ms per loop
You can speed this up by using a smarter implementation of the Snowball stemmer, e.g., PyStemmer:
>>> import Stemmer
>>> english_stemmer = Stemmer.Stemmer('en')
>>> class StemmedTfidfVectorizer(TfidfVectorizer):
... def build_analyzer(self):
... analyzer = super(TfidfVectorizer, self).build_analyzer()
... return lambda doc: english_stemmer.stemWords(analyzer(doc))
...
>>> tfidf = StemmedTfidfVectorizer(min_df=1, stop_words='english', analyzer='word', ngram_range=(1,1))
>>> %timeit tfidf.fit_transform(X_train)
1 loops, best of 3: 650 ms per loop
NLTK is a teaching toolkit. It's slow by design, because it's optimized for readability.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With