I am trying to understand the concept of embedding for the deep learning models.
I understand how employing word2vec can address the limitations of using the one-hot vectors.
However, recently I see a plethora of blog posts stating ELMo, BERT, etc. talking about contextual embedding.
How are word embeddings different from contextual embeddings?
Contextual embedding is a technique used in Natural Language Processing (NLP). Contextual embeddings assign each word a representation based on its context, thereby capturing uses of words across varied contexts and encoding knowledge that transfers across languages.
The Word embedding method made use of only the first 20 words while the TF-IDF method made use of all available words. Therefore the TF-IDF method gained more information from longer documents compared to the embedding method. (7% of total documents are longer then 20 words)
Two different learning models were introduced that can be used as part of the word2vec approach to learn the word embedding; they are: Continuous Bag-of-Words, or CBOW model. Continuous Skip-Gram Model.
Even though Word2Vec is an unsupervised model where you can give a corpus without any label information and the model can create dense word embeddings, Word2Vec internally leverages a supervised classification model to get these embeddings from the corpus.
Both embedding techniques, traditional word embedding (e.g. word2vec, Glove) and contextual embedding (e.g. ELMo, BERT), aim to learn a continuous (vector) representation for each word in the documents. Continuous representations can be used in downstream machine learning tasks.
Traditional word embedding techniques learn a global word embedding. They first build a global vocabulary using unique words in the documents by ignoring the meaning of words in different context. Then, similar representations are learnt for the words appeared more frequently close each other in the documents. The problem is that in such word representations the words' contextual meaning (the meaning derived from the words' surroundings), is ignored. For example, only one representation is learnt for "left" in sentence "I left my phone on the left side of the table." However, "left" has two different meanings in the sentence, and needs to have two different representations in the embedding space.
On the other hand, contextual embedding methods are used to learn sequence-level semantics by considering the sequence of all words in the documents. Thus, such techniques learn different representations for polysemous words, e.g. "left" in example above, based on their context.
Word embeddings and contextual embeddings are slightly different.
While both word embeddings and contextual embeddings are obtained from the models using unsupervised learning, there are some differences.
Word embeddings provided by word2vec or fastText has a vocabulary (dictionary) of words. The elements of this vocabulary (or dictionary) are words and its corresponding word embeddings. Hence, given a word, its embeddings is always the same in whichever sentence it occurs. Here, the pre-trained word embeddings are static.
However, contextual embeddings (are generally obtained from the transformer based models). The emeddings are obtained from a model by passing the entire sentence to the pre-trained model. Note that, here there is a vocabulary of words, but the vocabulary will not contain the contextual embeddings. The embeddings generated for each word depends on the other words in a given sentence. (The other words in a given sentence is referred as context. The transformer based models work on attention mechanism, and attention is a way to look at the relation between a word with its neighbors). Thus, given a word, it will not have a static embeddings, but the embeddings are dynamically generated from pre-trained (or fine-tuned) model.
For example, consider the two sentences:
Now, the word embeddings from a pre-trained embeddings such as word2vec, the embeddings for the word 'point' is same for both of its occurrences in example 1 and also the same for the word 'point' in example 2. (all three occurrences has same embeddings).
While, the embeddings from BERT or ELMO or any such transformer based models, the the two occurrences of the word 'point' in example 1 will have different embeddings. Also, the word 'point' occurring in example 2 will have different embeddings than the ones in example 1.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With