Python TfidfVectorizer throwing : empty vocabulary; perhaps the documents only contain stop words"

Question

I'm trying to use Python's Tfidf to transform a corpus of text. However, when I try to fit_transform it, I get a value error ValueError: empty vocabulary; perhaps the documents only contain stop words.

In [69]: TfidfVectorizer().fit_transform(smallcorp)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-69-ac16344f3129> in <module>()
----> 1 TfidfVectorizer().fit_transform(smallcorp)

/Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y)
   1217         vectors : array, [n_samples, n_features]
   1218         """
-> 1219         X = super(TfidfVectorizer, self).fit_transform(raw_documents)
   1220         self._tfidf.fit(X)
   1221         # X is already a transformed view of raw_documents so

/Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in fit_transform(self, raw_documents, y)
    778         max_features = self.max_features
    779 
--> 780         vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
    781         X = X.tocsc()
    782 

/Users/maxsong/anaconda/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc in _count_vocab(self, raw_documents, fixed_vocab)
    725             vocabulary = dict(vocabulary)
    726             if not vocabulary:
--> 727                 raise ValueError("empty vocabulary; perhaps the documents only"
    728                                  " contain stop words")
    729 

ValueError: empty vocabulary; perhaps the documents only contain stop words

I read through the SO question here: Problems using a custom vocabulary for TfidfVectorizer scikit-learn and tried ogrisel's suggestion of using TfidfVectorizer(**params).build_analyzer()(dataset2) to check the results of the text analysis step and that seems to be working as expected: snippet below:

In [68]: TfidfVectorizer().build_analyzer()(smallcorp)
Out[68]: 
[u'due',
 u'to',
 u'lack',
 u'of',
 u'personal',
 u'biggest',
 u'education',
 u'and',
 u'husband',
 u'to',

Is there something else that I am doing wrong? the corpus I am feeding it is just one giant long string punctuated by newlines.

Thanks!

herrfz · Accepted Answer

I guess it's because you just have one string. Try splitting it into a list of strings, e.g.:

In [51]: smallcorp
Out[51]: 'Ah! Now I have done Philosophy,
I have finished Law and Medicine,
And sadly even Theology:
Taken fierce pains, from end to end.
Now here I am, a fool for sure!
No wiser than I was before:'

In [52]: tf = TfidfVectorizer()

In [53]: tf.fit_transform(smallcorp.split('
'))
Out[53]: 
<6x28 sparse matrix of type '<type 'numpy.float64'>'
    with 31 stored elements in Compressed Sparse Row format>

Andreas Mueller · Answer

In version 0.12, we set the minimum document frequency to 2, which means that only words that appear at least twice will be considered. For your example to work, you need to set min_df=1. Since 0.13, this is the default setting. So I guess you are using 0.12, right?

Python TfidfVectorizer throwing : empty vocabulary; perhaps the documents only contain stop words"

Tags:

Max Song

2 Answers

herrfz

Andreas Mueller

Recent Activity

Donate For Us

Python TfidfVectorizer throwing : empty vocabulary; perhaps the documents only contain stop words"

Tags:

Max Song

2 Answers

herrfz

Andreas Mueller

Related questions

Recent Activity

Donate For Us