I am creating a neural net on a very large text dataset using keras. To build the model and make sure everything was working, I read a fraction of the data into memory, and use the built in keras 'Tokenizer' to do the necessary preprocessing stuff, including mapping each word to a token. Then, I use model.fit().
Now, I want to extend to the full dataset, and don't have the space to read all the data into memory. So, I'd like to make a generator function to sequentially read data from disk, and use model.fit_generator(). However, if I do this, then I separately fit a Tokenizer object on each batch of data, providing a different word-to-token mapping for each batch. Is there anyway around this? Is there any way I can continuously build a token dictionary using keras?
So basically you could define a text generator and feed it to fit_on_text
method in a following manner:
Assuming that you have texts_generator
which is reading partially your data from disk and returning an iterable collection of text you may define:
def text_generator(texts_generator):
for texts in texts_generator:
for text in texts:
yield text
Please take care that you should make this generator stop after reading a whole of data from disk - what could possible make you to change the original generator you want to use in model.fit_generator
Once you have the generator from 1. you may simply apply a tokenizer.fit_on_text
method by:
tokenizer.fit_on_text(text_generator)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With