I am trying to build a small LSTM that can learn to write code (even if it's garbage code) by training it on existing Python code. I have concatenated a few thousand lines of code together in one file across several hundred files, with each file ending in <eos>
to signify "end of sequence".
As an example, my training file looks like:
setup(name='Keras',
...
],
packages=find_packages())
<eos>
import pyux
...
with open('api.json', 'w') as f:
json.dump(sign, f)
<eos>
I am creating tokens from the words with:
file = open(self.textfile, 'r')
filecontents = file.read()
file.close()
filecontents = filecontents.replace("\n\n", "\n")
filecontents = filecontents.replace('\n', ' \n ')
filecontents = filecontents.replace(' ', ' \t ')
text_in_words = [w for w in filecontents.split(' ') if w != '']
self._words = set(text_in_words)
STEP = 1
self._codelines = []
self._next_words = []
for i in range(0, len(text_in_words) - self.seq_length, STEP):
self._codelines.append(text_in_words[i: i + self.seq_length])
self._next_words.append(text_in_words[i + self.seq_length])
My keras
model is:
model = Sequential()
model.add(Embedding(input_dim=len(self._words), output_dim=1024))
model.add(Bidirectional(
LSTM(128), input_shape=(self.seq_length, len(self._words))))
model.add(Dropout(rate=0.5))
model.add(Dense(len(self._words)))
model.add(Activation('softmax'))
model.compile(loss='sparse_categorical_crossentropy',
optimizer="adam", metrics=['accuracy'])
But no matter how much I train it, the model never seems to generate <eos>
or even \n
. I think it might be because my LSTM size is 128
and my seq_length
is 200, but that doesn't quite make sense? Is there something I'm missing?
Dense layers improve overall accuracy and 5–10 units or nodes per layer is a good base. So the output shape of the final dense layer will be affected by the number of neuron / units specified. Every LSTM layer should be accompanied by a dropout layer.
The vanilla LSTM network has three layers; an input layer, a single hidden layer followed by a standard feedforward output layer.
Keras LSTM stands for the Long short-term memory layer, which Hochreiter created in 1997. This layer uses available constraints and runtime hardware to gain the most optimized performance where we can choose the various implementation that is pure tensorflow or cuDNN based.
Sometimes, when there is no limit for code generation
or the <EOS> or <SOS> tokens are not numerical tokens
LSTM never converges. If you could send your outputs or error messages, it would be much easier to debug.
You could create an extra class for getting words and sentences.
# tokens for start of sentence(SOS) and end of sentence(EOS)
SOS_token = 0
EOS_token = 1
class Lang:
'''
class for word object, storing sentences, words and word counts.
'''
def __init__(self, name):
self.name = name
self.word2index = {}
self.word2count = {}
self.index2word = {0: "SOS", 1: "EOS"}
self.n_words = 2 # Count SOS and EOS
def addSentence(self, sentence):
for word in sentence.split(' '):
self.addWord(word)
def addWord(self, word):
if word not in self.word2index:
self.word2index[word] = self.n_words
self.word2count[word] = 1
self.index2word[self.n_words] = word
self.n_words += 1
else:
self.word2count[word] += 1
Then, while generating text, just adding a <SOS>
token would do.
You can use https://github.com/sherjilozair/char-rnn-tensorflow , a character level rnn for reference.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With