I am a newbie in implementation of language models in Keras RNN structures. I have a dataset of discrete words (not from a single paragraph) that have the following statistics,
Now, I want to build a model that will accept a character and predict the next character in the word. I have padded all the words so that they have same length. So my input is Word_input with shape 1953 x 9 and target is 1953 x 9 x 33. I also want to use Embedding layer. So my network architecture is,
self.wordmodel=Sequential()
self.wordmodel.add(Embedding(33,embedding_size,input_length=9))
self.wordmodel.add(LSTM(128, return_sequences=True))
self.wordmodel.add(TimeDistributed(Dense(33)))
self.wordmodel.compile(loss='mse',optimizer='rmsprop',metrics=['accuracy'])
As an example a word "CAT" with padding represents
Input to Network -- START C A T END * * * * (9 Characters)
Target of the same --- C A T END * * * * *(9 Characters)
So with the TimeDistributed
output I am measuring the difference of network prediction and target. I have also set the batch_size
to 1. So that after reading every sample word the network reset its state.
My question is am I doing it conceptually right? Whenever I am running my training the accuracy is stuck about 56%.
Kindly enlighten me. Thanks.
Character level embedding uses one-dimensional convolutional neural network (1D-CNN) to find numeric representation of words by looking at their character-level compositions. You can think of 1D-CNN as a process where we have several scanners sliding through a word, character by character.
Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space.
Having the character embedding, every single word's vector can be formed even it is out-of-vocabulary words (optional). On the other hand, word embedding can only handle those seen words.
Keras provides an embedding layer that converts each word into a fixed-length vector of defined size. The one-hot-encoding technique generates a large sparse matrix to represent a single word, whereas, in embedding layers, every word has a real-valued vector of fixed length.
In my knowledge, the structure is basic and may work to some degree. I have some suggestions
In the TimeDistributed
layer, you should add an activation
function softmax
which is wide employed in multi-classification
.
And now in your structure, the output is non-limited and it's not
intuitive as your target is just one-hot.
With softmax
function, you could change the loss to
cross-entropy
which increase the probability of correct class and
decrease the others. It's more appropriate.
you can take a try. For more useful model, you could try following structure which is given in Pytorch tutorial. Thanks.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With