Remove single occurrences of words in vocabulary TF-IDF

Tags:

I am attempting to remove words that occur once in my vocabulary to reduce my vocabulary size. I am using the sklearn TfidfVectorizer() and then the fit_transform function on my data frame.

tfidf = TfidfVectorizer()  
tfs = tfidf.fit_transform(df['original_post'].values.astype('U'))

My first thought is the preprocessor field in the tfidf vectorizer or using the preprocessing package before machine learning.

Any tips or links to further implementation?

755

asked Aug 22 '17 05:08

rglenn

1 Answers

you are looking for min_df param (minimum frequency), from the documentation of scikit-learn TfidfVectorizer:

min_df : float in range [0.0, 1.0] or int, default=1

When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

# remove words occuring less than 5 times
tfidf = TfidfVectorizer(min_df=5)

you can also remove common words:

# remove words occuring in more than half the documents
tfidf = TfidfVectorizer(max_df=0.5)

you can also remove stopwords like this:

tfidf = TfidfVectorizer(stop_words='english')

answered Sep 18 '22 13:09

ShmulikA

Related questions
                            
                                python error AttributeError: 'str' object has no attribute 'setdefault'
                            
                                Shrink/resize an image without interpolation
                            
                                Pandas convert columns type from list to np.array
                            
                                change xml element text using xml.etree.ElementTree
                            
                                Render Jinja after jQuery AJAX request to Flask
                            
                                OpenCV Error: Assertion failed when using COLOR_BGR2GRAY function
                            
                                PermissionError: [WinError 5] Access is denied: 'C:\\Program Files\\Anaconda3\\pkgs\\vs2015_runtime-14.0.25123-0.tmp
                            
                                how to install geckodriver on a windows system
                            
                                How to convert a list of strings into a numeric numpy array?
                            
                                How to directly use Axes3D from matplotlib in standard plot to avoid flake8 error
                            
                                Different precision on matplotlib axis
                            
                                Python regex to remove emails from string
                            
                                Pandas dataframe to_csv - split into multiple output files
                            
                                how to type sudo password when using subprocess.call?
                            
                                How can I ask setup.py to list dependencies?
                            
                                Python3 tkinter set image size
                            
                                Set "secure" attribute for Flask cookies
                            
                                Facing obstacle to install pyodbc and pymssql in ubuntu 16.04
                            
                                Jinja2 reverse a list
                            
                                AttributeError: module 'numpy' has no attribute 'flip'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Remove single occurrences of words in vocabulary TF-IDF

Tags:

python

scikit-learn

tf-idf

rglenn

People also ask

1 Answers

ShmulikA

Recent Activity

Donate For Us