I am attempting to remove words that occur once in my vocabulary to reduce my vocabulary size. I am using the sklearn TfidfVectorizer() and then the fit_transform function on my data frame.
tfidf = TfidfVectorizer()
tfs = tfidf.fit_transform(df['original_post'].values.astype('U'))
My first thought is the preprocessor field in the tfidf vectorizer or using the preprocessing package before machine learning.
Any tips or links to further implementation?
ShmulikA's answer will most likely work well but will remove words based on document frequency. Thus, if the specific word occurs 200 times in only 1 document, it will be removed. TF-IDF vectorizer does not provide exactly what you want.
The IDF of the word is the number of documents in the corpus separated by the frequency of the text. idf (t) = N/ df (t) = N/N (t) The more common word is supposed to be considered less significant, but the element (most definite integers) seems too harsh. We then take the logarithm (with base 2) of the inverse frequency of the paper.
To find the total unique words in our vocabulary, we need to take all the keys of DF. Recall that we need to maintain different weights for title and body. To calculate TF-IDF of body or title we need to consider both the title and body.
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
you are looking for min_df
param (minimum frequency), from the documentation of scikit-learn TfidfVectorizer
:
min_df : float in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
# remove words occuring less than 5 times
tfidf = TfidfVectorizer(min_df=5)
you can also remove common words:
# remove words occuring in more than half the documents
tfidf = TfidfVectorizer(max_df=0.5)
you can also remove stopwords like this:
tfidf = TfidfVectorizer(stop_words='english')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With