Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why word2vec doesn't use regularization?

ML models with huge number of parameters will tend to overfit (since they have a large variance). In my opinion, word2vec is one such models. One of the ways to reduce the model variance is to apply a regularization technique, which is very common thing for the other embedding models, such as matrix factorization. However, the basic version of word2vec doesn't have any regularization part. Is there a reason for this?

like image 744
Tural Gurbanov Avatar asked Jan 15 '18 15:01

Tural Gurbanov


People also ask

What is the problem of Word2Vec?

Perhaps the biggest problem with word2vec is the inability to handle unknown or out-of-vocabulary (OOV) words. If your model hasn't encountered a word before, it will have no idea how to interpret it or how to build a vector for it. You are then forced to use a random vector, which is far from ideal.

Why is Word2Vec better than TF IDF?

Some key differences between TF-IDF and word2vec is that TF-IDF is a statistical measure that we can apply to terms in a document and then use that to form a vector whereas word2vec will produce a vector for a term and then more work may need to be done to convert that set of vectors into a singular vector or other ...

Does Word2Vec use deep learning?

No, Word2Vec is not a deep learning model, it can use continuous bag-of-words or continuous skip-gram as distributed representations, but in any case, the number of parameters, layers and non-linearlities will be too small to be considered a deep learning model.


1 Answers

That's an interesting question.

I'd say that overfitting in Word2Vec doesn't make a lot of sense, because the goal of word embeddings to match the word occurrence distribution as exactly as possible. Word2Vec is not designed to learn anything outside of the training vocabulary, i.e., generalize, but to approximate the one distribution defined by the text corpus. In this sense, Word2Vec is actually trying to fit exactly, so it can't over-fit.

If you had a small vocabulary, it'd be possible to compute the co-occurrence matrix and find the exact global minimum for the embeddings (of a given size), i.e., get the perfect fit and that would define the best contextual word model for this fixed language.

like image 192
Maxim Avatar answered Oct 12 '22 03:10

Maxim