Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Train only some word embeddings (Keras)

In my model, I use GloVe pre-trained embeddings. I wish to keep them non-trainable in order to decrease the number of model parameters and avoid overfit. However, I have a special symbol whose embedding I do want to train.

Using the provided Embedding Layer, I can only use the parameter 'trainable' to set the trainability of all embeddings in the following way:

embedding_layer = Embedding(voc_size,
                        emb_dim,
                        weights=[embedding_matrix],
                        input_length=MAX_LEN,
                        trainable=False)

Is there a Keras-level solution to training only a subset of embeddings?

Please note:

  1. There is not enough data to generate new embeddings for all words.
  2. These answers only relate to native TensorFlow.
like image 746
miclat Avatar asked Feb 27 '18 13:02

miclat


People also ask

Does embedding layer get trained?

Most of the time when you use embeddings, you'll use them already trained and available - you won't be training them yourself. However, to understand what they are better, we'll mock up a dataset based on colour combinations, and learn the embeddings to turn a colour name into a location in both 2D and 3D space.

Is embedding layer in Keras trainable?

We load this embedding matrix into an Embedding layer. Note that we set trainable=False to prevent the weights from being updated during training. An Embedding layer should be fed sequences of integers, i.e. a 2D input of shape (samples, indices) .

How do I use Pretrained word embeds in Keras?

Train the model First, convert our list-of-strings data to NumPy arrays of integer indices. The arrays are right-padded. We use categorical crossentropy as our loss since we're doing softmax classification. Moreover, we use sparse_categorical_crossentropy since our labels are integers.

How do I choose an embed size?

If we're in a hurry, one rule of thumb is to use the fourth root of the total number of unique categorical elements while another is that the embedding dimension should be approximately 1.6 times the square root of the number of unique elements in the category, and no less than 600.


2 Answers

Found some nice workaround, inspired by Keith's two embeddings layers.

Main idea:

Assign the special tokens (and the OOV) with the highest IDs. Generate a 'sentence' containing only special tokens, 0-padded elsewhere. Then apply non-trainable embeddings to the 'normal' sentence, and trainable embeddings to the special tokens. Lastly, add both.

Works fine to me.

    # Normal embs - '+2' for empty token and OOV token
    embedding_matrix = np.zeros((vocab_len + 2, emb_dim))
    # Special embs
    special_embedding_matrix = np.zeros((special_tokens_len + 2, emb_dim))

    # Here we may apply pre-trained embeddings to embedding_matrix

    embedding_layer = Embedding(vocab_len + 2,
                        emb_dim,
                        mask_zero = True,
                        weights = [embedding_matrix],
                        input_length = MAX_SENT_LEN,
                        trainable = False)

    special_embedding_layer = Embedding(special_tokens_len + 2,
                            emb_dim,
                            mask_zero = True,
                            weights = [special_embedding_matrix],
                            input_length = MAX_SENT_LEN,
                            trainable = True)

    valid_words = vocab_len - special_tokens_len

    sentence_input = Input(shape=(MAX_SENT_LEN,), dtype='int32')

    # Create a vector of special tokens, e.g: [0,0,1,0,3,0,0]
    special_tokens_input = Lambda(lambda x: x - valid_words)(sentence_input)
    special_tokens_input = Activation('relu')(special_tokens_input)

    # Apply both 'normal' embeddings and special token embeddings
    embedded_sequences = embedding_layer(sentence_input)
    embedded_special = special_embedding_layer(special_tokens_input)

    # Add the matrices
    embedded_sequences = Add()([embedded_sequences, embedded_special])
like image 172
miclat Avatar answered Oct 12 '22 14:10

miclat


I haven't found a nice solution like a mask for the Embedding layer. But here's what I've been meaning to try:

  • Two embedding layers - one trainable and one not
  • The non-trainable one has all the Glove embeddings for in-vocab words and zero vectors for others
  • The trainable one only maps the OOV words and special symbols
  • The output of these two layers is added (I was thinking of this like ResNet)
  • The Conv/LSTM/etc below the embedding is unchanged

That would get you a solution with a small number of free parameters allocated to those embeddings.

like image 30
Keith Avatar answered Oct 12 '22 16:10

Keith