Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What embedding-layer output_dim is really needed for a dictionary of just 10000 words?

I'm training up an RNN with a very reduced set of word features, around 10,000. I was planning on starting with an embedding layer before adding RNNs, but it is very unclear to me what dimensionality is really needed. I know that I can try out different values (32, 64, etc.), but I'd rather have some intuition going into it first. For example, if I use a 32-dimensional embedding vector, then only 3 different values are needed per dimension to fully describe the space (32**3>>10000).

Alternatively, for a space with this small number of words, does one even really need to use an embedding layer or does it make more sense to just go from an input layer right to the RNN?

like image 756
AstroBen Avatar asked Jul 13 '18 15:07

AstroBen


People also ask

What is Output_dim in embedding layer?

output_dim: This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word. For example, it could be 32 or 100 or even larger.

How do I choose an embed size?

If we're in a hurry, one rule of thumb is to use the fourth root of the total number of unique categorical elements while another is that the embedding dimension should be approximately 1.6 times the square root of the number of unique elements in the category, and no less than 600.

What is word embedding layer?

Description. A word embedding layer maps word indices to vectors. Use a word embedding layer in a deep learning long short-term memory (LSTM) network. An LSTM network is a type of recurrent neural network (RNN) that can learn long-term dependencies between time steps of sequence data.

What is embedding layer in CNN?

Embedding layer enables us to convert each word into a fixed length vector of defined size. The resultant vector is a dense one with having real values instead of just 0's and 1's. The fixed length of word vectors helps us to represent words in a better way along with reduced dimensions.


1 Answers

This is a good question that does not have a good answer. You should surely use an embedding layer and not just go straight to an LSTM/GRU. However, the latent dimension of the embedding layer should be "as large as possible while maintain peak validation performance". For a dictionary around your size, 128 or 256 should be a reasonable decision. I doubt you will see drastically different performance.

However, something that will really affect your results on a small data set is not using pre-trained word embeddings. This will cause your embeddings to brutally overfit to your training data. I recommend using GLove word embeddings. After downloading the glove data, you can use them to initialize the weights to your embedding layer and then the emebdding layer will fine-tune the weights to your usecase. Here is some code I use for the GloVe embeddings with Keras. It let's you load different sizes of them and also caches the matrix so that it is fast to run the second time around.

class GloVeSize(Enum):

    tiny = 50
    small = 100
    medium = 200
    large = 300


__DEFAULT_SIZE = GloVeSize.small


def get_pretrained_embedding_matrix(word_to_index,
                                    vocab_size=10000,
                                    glove_dir="./bin/GloVe",
                                    use_cache_if_present=True,
                                    cache_if_computed=True,
                                    cache_dir='./bin/cache',
                                    size=__DEFAULT_SIZE,
                                    verbose=1):

    """
    get pre-trained word embeddings from GloVe: https://github.com/stanfordnlp/GloVe
    :param word_to_index: a word to index map of the corpus
    :param vocab_size: the vocab size
    :param glove_dir: the dir of glove
    :param use_cache_if_present: whether to use a cached weight file if present
    :param cache_if_computed: whether to cache the result if re-computed
    :param cache_dir: the directory of the project's cache
    :param size: an enumerated choice of GloVeSize
    :param verbose: the verbosity level of logging
    :return: a matrix of the embeddings
    """
    def vprint(*args, with_arrow=True):
        if verbose > 0:
            if with_arrow:
                print(">>", *args)
            else:
                print(*args)

    if not os.path.exists(cache_dir):
        os.makedirs(cache_dir)

    cache_path = os.path.join(cache_dir, 'glove_%d_embedding_matrix.npy' % size.value)
    if use_cache_if_present and os.path.isfile(cache_path):
        return np.load(cache_path)
    else:
        vprint('computing embeddings', with_arrow=True)
        embeddings_index = {}
        size_value = size.value
        f = open(os.path.join(glove_dir, 'glove.6B.' + str(size_value) + 'd.txt'),
                 encoding="ascii", errors='ignore')

        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs

        f.close()
        vprint('Found', len(embeddings_index), 'word vectors.')

        embedding_matrix = np.random.normal(size=(vocab_size, size.value))

        non = 0
        for word, index in word_to_index.items():
            embedding_vector = embeddings_index.get(word)
            if embedding_vector is not None:
                embedding_matrix[index] = embedding_vector
            else:
                non += 1

        vprint(non, "words did not have mappings")
        vprint(with_arrow=False)

        if cache_if_computed:
            np.save(cache_path, embedding_matrix)

return embedding_matrix

then instantiate your embedding layer with that weight matrix:

 embedding_size = GloVeSize.small
    embedding_matrix = get_pretrained_embedding_matrix(data.word_to_index,
size=embedding_size)

embedding = Embedding(
     output_dim=self.embedding_size,
     input_dim=self.vocabulary_size + 1,
     input_length=self.input_length,
     mask_zero=True,
     weights=[np.vstack((np.zeros((1, self.embedding_size)),
                         self.embedding_matrix))],
     name='embedding'
)(input_layer)
like image 147
modesitt Avatar answered Nov 15 '22 08:11

modesitt