I have been trying word2vec for a while now using the gensim's word2vec library. My question is do I have to remove stopwords from my input text? Because, based on my initial experimental results, I could see words like 'of', 'when'.. (stopwords) popping up when I do a <code>model.most_similar('someword')</code>..? But I didn't see anywhere referring that stop word removal is necessary with word2vec? Does the word2vec is supposed to handle stop words even if you don't remove them? What are the must do pre processing things (like for topic modeling, it's almost a must that you should do stopword removal)?

Gensim's implementation is based on the original Tomas Mikolov model of word2vec, then it downsamples all frequent words automatically based on frequency. As stated in the paper: <blockquote> We show that subsampling of frequent words during training results in a significant speedup (around 2x - 10x), and improves accuracy of the representations of less frequent words. </blockquote> What it means is that these words are sometimes not considered in the window of the words to be predicted. The sample parameter which defaults to 0.001 is used as a parameter to prune out those words. If you want to remove some specific stopwords which would not be removed based on its frequency, you can do that. Summary : The result would not make any significant difference if you do stop words removal.

stopword removing when using the word2vec

Tags:

I have been trying word2vec for a while now using the gensim's word2vec library. My question is do I have to remove stopwords from my input text? Because, based on my initial experimental results, I could see words like 'of', 'when'.. (stopwords) popping up when I do a model.most_similar('someword')..?

But I didn't see anywhere referring that stop word removal is necessary with word2vec? Does the word2vec is supposed to handle stop words even if you don't remove them?

What are the must do pre processing things (like for topic modeling, it's almost a must that you should do stopword removal)?

294

asked Jan 11 '16 12:01

samsamara

1 Answers

Gensim's implementation is based on the original Tomas Mikolov model of word2vec, then it downsamples all frequent words automatically based on frequency.

As stated in the paper:

We show that subsampling of frequent words during training results in a significant speedup (around 2x - 10x), and improves accuracy of the representations of less frequent words.

What it means is that these words are sometimes not considered in the window of the words to be predicted. The sample parameter which defaults to 0.001 is used as a parameter to prune out those words. If you want to remove some specific stopwords which would not be removed based on its frequency, you can do that.

Summary : The result would not make any significant difference if you do stop words removal.

111

answered Sep 30 '22 21:09

Trideep Rath

Related questions
                            
                                How to automatically reload modules in IPython?
                            
                                MySQL ALTER TABLE hangs
                            
                                Swift: error: use of instance member on type
                            
                                seaborn heatmap using pandas dataframe
                            
                                How can I install the latest Anaconda with wget
                            
                                Getting "ImportError: No Module named yaml" error
                            
                                Compile a single file under CMake project?
                            
                                BottomSheet fly away with visibility change
                            
                                How to find latest or most recent AWS RDS snapshot?
                            
                                How to add custom font sizes to QuillJS editor
                            
                                how do i backup a database in docker
                            
                                DataGrip: how to connect to Oracle as SYSDBA

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With