I use TfidfVectorizer like this:
from sklearn.feature_extraction.text import TfidfVectorizer
stop_words = stopwords.words("english")
vectorizer = TfidfVectorizer(stop_words=stop_words, min_df=200)
xs['train'] = vectorizer.fit_transform(docs['train'])
xs['test'] = vectorizer.transform(docs['test']).toarray()
But when inspecting vectorizer.vocabulary_
I've noticed that it learns pure number features:
[(u'00', 0), (u'000', 1), (u'0000', 2), (u'00000', 3), (u'000000', 4)
I don't want this. How can I prevent it?
From the way the TfIdf score is set up, there shouldn't be any significant difference in removing the stopwords. The whole point of the Idf is exactly to remove words with no semantic value from the corpus. If you do add the stopwords, the Idf should get rid of it.
With Tfidftransformer you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores. With Tfidfvectorizer on the contrary, you will do all three steps at once.
The main difference between the 2 implementations is that TfidfVectorizer performs both term frequency and inverse document frequency for you, while using TfidfTransformer will require you to use the CountVectorizer class from Scikit-Learn to perform Term Frequency.
In particular, we pass the TfIdfVectorizer our own function that performs custom tokenization and stemming, but we use scikit-learn's built in stop word remove rather than NLTK's.
You could define the token_pattern
when initing the vectorizer. The default one is u'(?u)\b\w\w+\b'
(the (?u)
part is just turning the re.UNICODE
flag on). Could fiddle with that until you get what you need.
Something like:
vectorizer = TfidfVectorizer(stop_words=stop_words,
min_df=200,
token_pattern=u'(?u)\b\w*[a-zA-Z]\w*\b')
Another option (if the fact that numbers appear in your samples matter) is to mask all the numbers before vectorizing.
re.sub('\b[0-9][0-9.,-]*\b', 'NUMBER-SPECIAL-TOKEN', sample)
This way numbers will hit the same spot in your vectorizer's vocabulary and you won't completely ignore them either.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With