Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What should be the word vectors of token <pad>, <unknown>, <go>, <EOS> before sent into RNN?

In word embedding, what should be a good vector representation for the start_tokens _PAD, _UNKNOWN, _GO, _EOS?

like image 581
Wenchen Li Avatar asked Jan 26 '17 19:01

Wenchen Li


People also ask

Which word embedding is used in BERT?

Setting up PyTorch to get BERT embeddings. Extracting word embeddings (“Context-free” pre-trained embedding, “Context-based” pre-trained embedding, “Context-averaged” pre-trained embedding)

What are contextualized word Embeddings?

Contextual embeddings assign each word a representation based on its context, thereby capturing uses of words across varied contexts and encoding knowledge that transfers across languages.

What are token embeddings?

Token embeddings are the pre-trained embeddings for different words. In order to create these pretrain token embeddings, a method called WordPiece tokenization is used to tokenize the text. This is a data-driven tokenization strategy that tries for a good balance of vocabulary size and out-of-vocab words.

How are BERT embeddings trained?

Segment ID. BERT is trained on and expects sentence pairs, using 1s and 0s to distinguish between the two sentences. That is, for each token in “tokenized_text,” we must specify which sentence it belongs to: sentence 0 (a series of 0s) or sentence 1 (a series of 1s).


2 Answers

Spettekaka's answer works if you are updating your word embedding vectors as well.

Sometimes you will want to use pretrained word vectors that you can't update, though. In this case, you can add a new dimension to your word vectors for each token you want to add and set the vector for each token to 1 in the new dimension and 0 for every other dimension. That way, you won't run into a situation where e.g. "EOS" is closer to the vector embedding of "start" than it is to the vector embedding of "end".

Example for clarification:

# assume_vector embeddings is a dictionary and word embeddings are 3-d before adding tokens
# e.g. vector_embedding['NLP'] = np.array([0.2, 0.3, 0.4]) 

vector_embedding['<EOS>'] = np.array([0,0,0,1])
vector_embedding['<PAD>'] = np.array([0,0,0,0,1])
new_vector_length = vector_embedding['<pad>'].shape[0] # length of longest vector
for key, word_vector in vector_embedding.items():
    zero_append_length = new_vector_length - word_vector.shape[0]
    vector_embedding[key] = np.append(word_vector, np.zeros(zero_append_length))

Now your dictionary of word embeddings contains 2 new dimensions for your tokens and all of your words have been updated.

like image 199
TrentWoodbury Avatar answered Oct 18 '22 21:10

TrentWoodbury


As far as I understand you can represent these tokens by any vector.

Here's why:

Inputting a sequence of words to your model, you first convert each word to an ID and then look in your embedding-matrix which vector corresponds to that ID. With that vector, you train your model. But the embedding-matrix just contains also trainable weights which will be adjusted during training. The vector-representations from your pretrained vectors just serve as a good point to start to yield good results.

Thus, it doesn't matter that much what your special tokens are represented by in the beginning as their representation will change during training.

like image 42
spettekaka Avatar answered Oct 18 '22 21:10

spettekaka