word2vec lemmatization of corpus before training

Tags:

Word2vec seems to be mostly trained on raw corpus data. However, lemmatization is a standard preprocessing for many semantic similarity tasks. I was wondering if anybody had experience in lemmatizing the corpus before training word2vec and if this is a useful preprocessing step to do.

269

asked May 26 '14 20:05

Luca Fiaschi

1 Answers

I think it really matters about what you want to solve with this. It depends on the task.

Essentially by lemmatization, you make the input space sparser, which can help if you don't have enough training data.

But since Word2Vec is fairly big, if you have big enough training data, lemmatization shouldn't gain you much.

Something more interesting is, how to do tokenization with respect to the existing diction of words-vectors inside the W2V (or anything else). Like "Good muffins cost $3.88\nin New York." needs to be tokenized to ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New York.'] Then you can replace it with its vectors from W2V. The challenge is that some tokenizers my tokenize "New York" as ['New' 'York'], which doesn't make much sense. (For example, NLTK is making this mistake https://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html) This is a problem when you have many multi-word phrases.

answered Sep 25 '22 04:09

Daniel

Related questions
                            
                                Understanding NLTK collocation scoring for bigrams and trigrams
                            
                                Natural language date/time parser for .NET? [closed]
                            
                                Combining a Tokenizer into a Grammar and Parser with NLTK
                            
                                NLTK for Named Entity Recognition
                            
                                What does NN VBD IN DT NNS RB means in NLTK?
                            
                                What is the difference between Dialogflow bot framework vs Rasa nlu bot framework?
                            
                                How to print the LDA topics models from gensim? Python
                            
                                Python - RegEx for splitting text into sentences (sentence-tokenizing) [duplicate]
                            
                                Is it possible to train Stanford NER system to recognize more named entities types?
                            
                                Pointwise mutual information on text
                            
                                Generating questions from text (NLP)
                            
                                How does Amazon's Statistically Improbable Phrases work?
                            
                                Ease of use: Stanford CoreNLP vs. OpenNLP [closed]
                            
                                Data sets for emotion detection in text [closed]
                            
                                How can I use NLP to parse recipe ingredients?
                            
                                How can I split a text into sentences using the Stanford parser?
                            
                                What is a good Java library for Parts-Of-Speech tagging? [closed]
                            
                                Stack Overflow Related questions algorithm [closed]
                            
                                NLTK vs Stanford NLP
                            
                                What is the difference between Luong attention and Bahdanau attention?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

word2vec lemmatization of corpus before training

Tags:

nlp

lemmatization

gensim

word2vec

Luca Fiaschi

People also ask

1 Answers

Daniel

Recent Activity

Donate For Us