Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove single occurrences of words in vocabulary TF-IDF

I am attempting to remove words that occur once in my vocabulary to reduce my vocabulary size. I am using the sklearn TfidfVectorizer() and then the fit_transform function on my data frame.

tfidf = TfidfVectorizer()  
tfs = tfidf.fit_transform(df['original_post'].values.astype('U')) 

My first thought is the preprocessor field in the tfidf vectorizer or using the preprocessing package before machine learning.

Any tips or links to further implementation?

like image 755
rglenn Avatar asked Aug 22 '17 05:08

rglenn


People also ask

Does tf-idf vectorizer remove specific words from a document?

ShmulikA's answer will most likely work well but will remove words based on document frequency. Thus, if the specific word occurs 200 times in only 1 document, it will be removed. TF-IDF vectorizer does not provide exactly what you want.

How to calculate the IDF of a word?

The IDF of the word is the number of documents in the corpus separated by the frequency of the text. idf (t) = N/ df (t) = N/N (t) The more common word is supposed to be considered less significant, but the element (most definite integers) seems too harsh. We then take the logarithm (with base 2) of the inverse frequency of the paper.

How to calculate total unique words in your vocabulary?

To find the total unique words in our vocabulary, we need to take all the keys of DF. Recall that we need to maintain different weights for title and body. To calculate TF-IDF of body or title we need to consider both the title and body.

What is cut-off in the vocabulary parameter?

When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.


1 Answers

you are looking for min_df param (minimum frequency), from the documentation of scikit-learn TfidfVectorizer:

min_df : float in range [0.0, 1.0] or int, default=1

When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

# remove words occuring less than 5 times
tfidf = TfidfVectorizer(min_df=5)

you can also remove common words:

# remove words occuring in more than half the documents
tfidf = TfidfVectorizer(max_df=0.5)

you can also remove stopwords like this:

tfidf = TfidfVectorizer(stop_words='english')
like image 73
ShmulikA Avatar answered Sep 18 '22 13:09

ShmulikA