Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I prevent TfidfVectorizer to get numbers as vocabulary

I use TfidfVectorizer like this:

from sklearn.feature_extraction.text import TfidfVectorizer
stop_words = stopwords.words("english")
vectorizer = TfidfVectorizer(stop_words=stop_words, min_df=200)
xs['train'] = vectorizer.fit_transform(docs['train'])
xs['test'] = vectorizer.transform(docs['test']).toarray()

But when inspecting vectorizer.vocabulary_ I've noticed that it learns pure number features:

[(u'00', 0), (u'000', 1), (u'0000', 2), (u'00000', 3), (u'000000', 4)

I don't want this. How can I prevent it?

like image 574
Martin Thoma Avatar asked Aug 07 '17 13:08

Martin Thoma


People also ask

Does TfidfVectorizer remove stop words?

From the way the TfIdf score is set up, there shouldn't be any significant difference in removing the stopwords. The whole point of the Idf is exactly to remove words with no semantic value from the corpus. If you do add the stopwords, the Idf should get rid of it.

What is the difference between CountVectorizer and TfidfVectorizer?

With Tfidftransformer you will systematically compute word counts using CountVectorizer and then compute the Inverse Document Frequency (IDF) values and only then compute the Tf-idf scores. With Tfidfvectorizer on the contrary, you will do all three steps at once.

What is the difference between TfidfVectorizer and TfidfTransformer?

The main difference between the 2 implementations is that TfidfVectorizer performs both term frequency and inverse document frequency for you, while using TfidfTransformer will require you to use the CountVectorizer class from Scikit-Learn to perform Term Frequency.

Does TfidfVectorizer do Stemming?

In particular, we pass the TfIdfVectorizer our own function that performs custom tokenization and stemming, but we use scikit-learn's built in stop word remove rather than NLTK's.


1 Answers

You could define the token_pattern when initing the vectorizer. The default one is u'(?u)\b\w\w+\b' (the (?u) part is just turning the re.UNICODE flag on). Could fiddle with that until you get what you need.

Something like:

vectorizer = TfidfVectorizer(stop_words=stop_words,
                             min_df=200,
                             token_pattern=u'(?u)\b\w*[a-zA-Z]\w*\b')

Another option (if the fact that numbers appear in your samples matter) is to mask all the numbers before vectorizing.

re.sub('\b[0-9][0-9.,-]*\b', 'NUMBER-SPECIAL-TOKEN', sample)

This way numbers will hit the same spot in your vectorizer's vocabulary and you won't completely ignore them either.

like image 172
Iulius Curt Avatar answered Oct 03 '22 16:10

Iulius Curt