Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

word2vec lemmatization of corpus before training

Word2vec seems to be mostly trained on raw corpus data. However, lemmatization is a standard preprocessing for many semantic similarity tasks. I was wondering if anybody had experience in lemmatizing the corpus before training word2vec and if this is a useful preprocessing step to do.

like image 269
Luca Fiaschi Avatar asked May 26 '14 20:05

Luca Fiaschi


People also ask

Should I lemmatize before Word2Vec?

It depends on the task. Essentially by lemmatization, you make the input space sparser, which can help if you don't have enough training data. But since Word2Vec is fairly big, if you have big enough training data, lemmatization shouldn't gain you much.

Do we need to use a Lemmatizer for word Embeddings?

The experiments showed that while lemmatization is indeed not necessary for English, the situation is different for Rus- sian. It seems that for rich-morphology languages, using lemmatized training and testing data yields small but con- sistent improvements: at least for word sense disambiguation.

How do you lemmatize a corpus in Python?

In order to lemmatize, you need to create an instance of the WordNetLemmatizer() and call the lemmatize() function on a single word. Let's lemmatize a simple sentence. We first tokenize the sentence into words using nltk. word_tokenize and then we will call lemmatizer.

How is lemmatization done?

Lemmatization usually involves using a vocabulary and morphological analysis of words, removing inflectional endings, and returning the dictionary form of a word (the lemma). The morphological analysis would need the extraction of the correct lemma of every word.


1 Answers

I think it really matters about what you want to solve with this. It depends on the task.

Essentially by lemmatization, you make the input space sparser, which can help if you don't have enough training data.

But since Word2Vec is fairly big, if you have big enough training data, lemmatization shouldn't gain you much.

Something more interesting is, how to do tokenization with respect to the existing diction of words-vectors inside the W2V (or anything else). Like "Good muffins cost $3.88\nin New York." needs to be tokenized to ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New York.'] Then you can replace it with its vectors from W2V. The challenge is that some tokenizers my tokenize "New York" as ['New' 'York'], which doesn't make much sense. (For example, NLTK is making this mistake https://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html) This is a problem when you have many multi-word phrases.

like image 70
Daniel Avatar answered Sep 25 '22 04:09

Daniel