How does Fine-tuning Word Embeddings work?

I've been reading some NLP with Deep Learning papers and found Fine-tuning seems to be a simple but yet confusing concept. There's been the same question asked here but still not quite clear.

Fine-tuning pre-trained word embeddings to task-specific word embeddings as mentioned in papers like Y. Kim, “Convolutional Neural Networks for Sentence Classification,” and K. S. Tai, R. Socher, and C. D. Manning, “Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks,” had only a brief mention without getting into any details.

My question is:

Word Embeddings generated using word2vec or Glove as pretrained word vectors are used as input features (X) for downstream tasks like parsing or sentiment analysis, meaning those input vectors are plugged into a new neural network model for some specific task, while training this new model, somehow we can get updated task-specific word embeddings.

But as far as I know, during the training, what back-propagation does is updating the weights (W) of the model, it does not change the input features (X), so how exactly does the original word embeddings get fine-tuned? and where do these fine-tuned vectors come from?

How does Word2Vec CBOW work?

How does CBOW work? Even though Word2Vec is an unsupervised model where you can give a corpus without any label information and the model can create dense word embeddings, Word2Vec internally leverages a supervised classification model to get these embeddings from the corpus.

Why is GloVe better than Word2Vec?

The additional benefits of GloVe over word2vec is that it is easier to parallelize the implementation which means it's easier to train over more data, which, with these models, is always A Good Thing.

How are word embeddings usually evaluated?

Word embeddings are widely used nowadays in Distributional Semantics and for a variety of tasks in NLP. Embeddings can be evaluated using ex- trinsic evaluation methods, i.e. the trained em- beddings are evaluated on a specific task such as part-of-speech tagging or named-entity recogni- tion (Schnabel et al., 2015).

Yes, if you feed the embedding vector as your input, you can't fine-tune the embeddings (at least easily). However, all the frameworks provide some sort of an EmbeddingLayer that takes as input an integer that is the class ordinal of the word/character/other input token, and performs a embedding lookup. Such an embedding layer is very similar to a fully connected layer that is fed a one-hot encoded class, but is way more efficient, as it only needs to fetch/change one row from the matrix on both front and back passes. More importantly, it allows the weights of the embedding to be learned.

So the classic way would be to feed the actual classes to the network instead of embeddings, and prepend the entire network with a embedding layer, that is initialized with word2vec / glove, and which continues learning the weights. It might also be reasonable to freeze them for several iterations at the beginning until the rest of the network starts doing something reasonable with them before you start fine tuning them.

How does Fine-tuning Word Embeddings work?

Tags:

machine-learning

deep-learning

word-embedding

LingxB

People also ask

1 Answers

Ishamael

Recent Activity

Donate For Us

How does Fine-tuning Word Embeddings work?

Tags:

machine-learning

deep-learning

word-embedding

LingxB

People also ask

1 Answers

Ishamael

Related questions

Recent Activity

Donate For Us