Suppose we're training a neural network model to learn the mapping from the following input to output, where the output is Name Entity (NE).
Input: EU rejects German call to boycott British lamb .
Output: ORG O MISC O O O MISC O O
A sliding window is created to capture the context information and its outcomes are fed into the training model as model_input. The sliding window generates results as following:
[['<s>', '<s>', 'EU', 'rejects', 'German'],\
['<s>', 'EU', 'rejects', 'German', 'call'],\
['EU', 'rejects', 'German', 'call', 'to'],\
['rejects', 'German', 'call', 'to', 'boycott'],\
['German', 'call', 'to', 'boycott', 'British'],\
['call', 'to', 'boycott', 'British', 'lamb'],\
['to', 'boycott', 'British', 'lamb', '.'],\
['boycott', 'British', 'lamb', '.', '</s>'],\
['British', 'lamb', '.', '</s>', '</s>']]
<s>
represents start of sentence token and </s>
represents end of sentence token, and every sliding window corresponds to one NE in output.
To process these tokens, a pre-trained embedding model is used converting words to vectors (e.g., Glove), but those pre-trained models do not include tokens such as <s>
and </s>
. I think random initialization for <s>
and </s>
won't be a good idea here, because the scale of such random results might not be consistent with other Glove embeddings.
Question:
What suggestions of setting up embeddings for <s>
and </s>
and why?
In general, the answer depends on how you intend to use the embeddings in your task.
I suspect that the use of <s>
and </s>
tokens is dictated by LSTM or other recurrent neural network, that goes after embedding layer. If you were to train word embeddings themselves, I'd suggest you to simply get rid of these tokens, because they don't add any value. Start and stop tokens do matter in LSTM (though not always), but their word embeddings can be arbitrary, small random numbers will do fine, because this vector would be equally far from all "normal" vectors.
If you don't want to mess with pre-trained GloVe vectors, I would suggest you to freeze the embedding layer. For example, in tensorflow this can be achieved by tf.stop_gradient
op right after the embedding lookup. This way the network won't learn any relation between <s>
and other words, but it's totally fine, and any existing relations won't change.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With