I am trying to understand the concept of embedding for the deep learning models. I understand how employing <code>word2vec</code> can address the limitations of using the one-hot vectors. However, recently I see a plethora of blog posts stating ELMo, BERT, etc. talking about contextual embedding. How are word embeddings different from contextual embeddings?

Word embeddings and contextual embeddings are slightly different. While both word embeddings and contextual embeddings are obtained from the models using unsupervised learning, there are some differences. Word embeddings provided by <code>word2vec</code> or <code>fastText</code> has a vocabulary (dictionary) of words. The elements of this vocabulary (or dictionary) are words and its corresponding word embeddings. Hence, given a word, its embeddings is always the same in whichever sentence it occurs. Here, the pre-trained word embeddings are <code>static</code>. However, contextual embeddings (are generally obtained from the transformer based models). The emeddings are obtained from a model by passing the entire sentence to the pre-trained model. Note that, here there is a vocabulary of words, but the vocabulary will not contain the contextual embeddings. The embeddings generated for each word depends on the other words in a given sentence. (The other words in a given sentence is referred as <code>context</code>. The transformer based models work on attention mechanism, and attention is a way to look at the relation between a word with its neighbors). Thus, given a word, it will not have a static embeddings, but the embeddings are dynamically generated from pre-trained (or fine-tuned) model. For example, consider the two sentences: <ol> <li>I will show you a valid point of reference and talk to the point.</li> <li>Where have you placed the point.</li> </ol> Now, the word embeddings from a pre-trained embeddings such as word2vec, the embeddings for the word <code>'point'</code> is same for both of its occurrences in example 1 and also the same for the word <code>'point'</code> in example 2. (all three occurrences has same embeddings). While, the embeddings from BERT or ELMO or any such transformer based models, the the two occurrences of the word <code>'point'</code> in example 1 will have different embeddings. Also, the word <code>'point'</code> occurring in example 2 will have different embeddings than the ones in example 1.

What are the differences between contextual embedding and word embedding

2 Answers

Both embedding techniques, traditional word embedding (e.g. word2vec, Glove) and contextual embedding (e.g. ELMo, BERT), aim to learn a continuous (vector) representation for each word in the documents. Continuous representations can be used in downstream machine learning tasks.

Traditional word embedding techniques learn a global word embedding. They first build a global vocabulary using unique words in the documents by ignoring the meaning of words in different context. Then, similar representations are learnt for the words appeared more frequently close each other in the documents. The problem is that in such word representations the words' contextual meaning (the meaning derived from the words' surroundings), is ignored. For example, only one representation is learnt for "left" in sentence "I left my phone on the left side of the table." However, "left" has two different meanings in the sentence, and needs to have two different representations in the embedding space.

On the other hand, contextual embedding methods are used to learn sequence-level semantics by considering the sequence of all words in the documents. Thus, such techniques learn different representations for polysemous words, e.g. "left" in example above, based on their context.

138

answered Oct 21 '22 12:10

Roohollah Etemadi

Word embeddings and contextual embeddings are slightly different.

While both word embeddings and contextual embeddings are obtained from the models using unsupervised learning, there are some differences.

Word embeddings provided by word2vec or fastText has a vocabulary (dictionary) of words. The elements of this vocabulary (or dictionary) are words and its corresponding word embeddings. Hence, given a word, its embeddings is always the same in whichever sentence it occurs. Here, the pre-trained word embeddings are static.

However, contextual embeddings (are generally obtained from the transformer based models). The emeddings are obtained from a model by passing the entire sentence to the pre-trained model. Note that, here there is a vocabulary of words, but the vocabulary will not contain the contextual embeddings. The embeddings generated for each word depends on the other words in a given sentence. (The other words in a given sentence is referred as context. The transformer based models work on attention mechanism, and attention is a way to look at the relation between a word with its neighbors). Thus, given a word, it will not have a static embeddings, but the embeddings are dynamically generated from pre-trained (or fine-tuned) model.

For example, consider the two sentences:

I will show you a valid point of reference and talk to the point.
Where have you placed the point.

Now, the word embeddings from a pre-trained embeddings such as word2vec, the embeddings for the word 'point' is same for both of its occurrences in example 1 and also the same for the word 'point' in example 2. (all three occurrences has same embeddings).

While, the embeddings from BERT or ELMO or any such transformer based models, the the two occurrences of the word 'point' in example 1 will have different embeddings. Also, the word 'point' occurring in example 2 will have different embeddings than the ones in example 1.

answered Oct 21 '22 13:10

Ashwin Geet D'Sa

Related questions
                            
                                Data Prediction using Decision Tree of rpart
                            
                                Plot confusion matrix sklearn with multiple labels
                            
                                Accuracy difference on normalization in KNN
                            
                                how to get covariance matrix in tensorflow?
                            
                                Plot loss evolution during a single epoch in Keras
                            
                                Comparing AUC, log loss and accuracy scores between models
                            
                                Difference between Shuffle and Random_State in train test split?
                            
                                ERROR: Cannot uninstall 'ruamel-yaml' while creating docker image for azure ML ACI deployment
                            
                                No module named 'tensorflow.keras.layers.experimental.preprocessing'
                            
                                Handling Incomplete Data (Data Sparsity) in kNN
                            
                                wrong model type for regression error in 10 fold cross validation for Naive Bayes using R
                            
                                Increase or decrease learning rate for adding neurons or weights?
                            
                                Zero initialiser for biases using get_variable in tensorflow
                            
                                Tensorflow Error : No Variables to optimize
                            
                                ValueError: multiclass format is not supported , xgboost
                            
                                Python NLP Intent Identification
                            
                                How do I turn a Pytorch Dataloader into a numpy array to display image data with matplotlib?
                            
                                Stop Keras Training when the network has fully converge
                            
                                LSTM autoencoder always returns the average of the input sequence
                            
                                Pytorch: How can I find indices of first nonzero element in each row of a 2D tensor?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What are the differences between contextual embedding and word embedding

Tags:

artificial-intelligence

machine-learning

deep-learning

Exploring

People also ask

2 Answers

Roohollah Etemadi

Ashwin Geet D'Sa

Recent Activity

Donate For Us