I am starting using Keras to build neural networks models.
I have a classification problem, where the features are discrete. To manage this case, the standard procedure consists in converting the discrete features in binary arrays, with a one-hot encoding.
However it seems that with Keras this step is not necessary, as one can simply use an Embedding layer to create a feature-vector representation of these discrete features.
How these embeddings are performed?
My understanding is that, if the discrete feature f
can assume k
values, then an embedding layer creates a matrix with k
columns. Every time I receive a value for that feature, say i
, during the training phase, only the i
column of the matrix will be updated.
Is my understanding correct?
The Embedding layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are: (batch, sequence, embedding) .
Keras Embedding Layer. Keras offers an Embedding layer that can be used for neural networks on text data. It requires that the input data be integer encoded, so that each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API also provided with Keras.
An embedding layer is faster, because it is essentially the equivalent of a dense layer that makes simplifying assumptions. A Dense layer will treat these like actual weights with which to perform matrix multiplication.
Most of the time when you use embeddings, you'll use them already trained and available - you won't be training them yourself. However, to understand what they are better, we'll mock up a dataset based on colour combinations, and learn the embeddings to turn a colour name into a location in both 2D and 3D space.
Suppose you have N objects that do not directly have a mathematical representation. For example words.
As neural networks are only able to work with tensors you should look for some way to translate those objects to tensors. The solution is in a giant matrix (embedding matrix) where it relates each index of an object with its translation to tensor.
object_index_1: vector_1
object_index_1: vector_2
...
object_index_n: vector_n
Selecting the vector of a specific object can be translated to a matrix product in the following way:
Where v is the one-hot vector that determines which word need to be translated. And M is the embedding matrix.
If we propose the usual pipeline, it would be the following:
objects = ['cat', 'dog', 'snake', 'dog', 'mouse', 'cat', 'dog', 'snake', 'dog']
unique = ['cat', 'dog', 'snake', 'mouse'] # list(set(objects))
objects_index = [0, 1, 2, 1, 3, 0, 1, 2, 1] #map(unique.index, objects)
objects_one_hot = [[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0],
[0, 0 , 0, 1], [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0]] # map(lambda x: [int(i==x) for i in range(len(unique))], objects_index)
#objects_one_hot is matrix is 4x9
#M = matrix of dim x 4 (where dim is the number of dimensions you want the vectors to have).
#In this case dim=2
M = np.array([[1, 1], [1, 2], [2, 2], [3,3]]).T # or... np.random.rand(2, 4)
#objects_vectors = M * objects_one_hot
objects_vectors = [[1, 1], [1, 2], [2, 2], [1, 2],
[3, 3], [1, 1], [1, 2], [2,2], [1, 2]] # M.dot(np.array(objects_one_hot).T)
Normally the embedding matrix is learned during the same model learning, to adapt the best vectors for each object. We already have the mathematical representation of the objects!
As you have seen we have used one hot and later a matrix product. What you really do is take the column of M that represents that word.
During the learning this M will be adapted to improve the representation of the object and as a consequence the loss goes down.
The Embedding layer in Keras (also in general) is a way to create dense word encoding. You should think of it as a matrix multiply by One-hot-encoding (OHE) matrix, or simply as a linear layer over OHE matrix.
It is used always as a layer attached directly to the input.
Sparse and dense word encoding denote the encoding effectiveness.
One-hot-encoding (OHE) model is sparse word encoding model. For example if we have 1000 input activations, there will be 1000 OHE vectors for each input feature.
Let's say we know some input activations are dependent, and we have 64 latent features. We would have this embedding:
e = Embedding(1000, 64, input_length=50)
1000 tells we plan to encode 1000 words in total. 64 tells we use 64 dimensional vector space. 50 tells input documents have 50 words each.
Embedding layers will fill up randomly with non-zero values and the parameters need to be learned.
There are other parameters when creating the Embedding layer in here
What is the output from the Embedding layer?
The output of the Embedding layer is a 2D-vector with one embedding for each word in the input sequence of words (input document).
NOTE: If you wish to connect a Dense layer directly to an Embedding layer, you must first flatten the 2D output matrix to a 1D vector using the Flatten layer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With