Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keras: Tokenizer with fit_generator() on text data

Tags:

python

keras

I am creating a neural net on a very large text dataset using keras. To build the model and make sure everything was working, I read a fraction of the data into memory, and use the built in keras 'Tokenizer' to do the necessary preprocessing stuff, including mapping each word to a token. Then, I use model.fit().

Now, I want to extend to the full dataset, and don't have the space to read all the data into memory. So, I'd like to make a generator function to sequentially read data from disk, and use model.fit_generator(). However, if I do this, then I separately fit a Tokenizer object on each batch of data, providing a different word-to-token mapping for each batch. Is there anyway around this? Is there any way I can continuously build a token dictionary using keras?

like image 554
Ben F Avatar asked Mar 03 '17 00:03

Ben F


1 Answers

So basically you could define a text generator and feed it to fit_on_text method in a following manner:

  1. Assuming that you have texts_generator which is reading partially your data from disk and returning an iterable collection of text you may define:

    def text_generator(texts_generator):
        for texts in texts_generator:
            for text in texts:
                yield text
    

    Please take care that you should make this generator stop after reading a whole of data from disk - what could possible make you to change the original generator you want to use in model.fit_generator

  2. Once you have the generator from 1. you may simply apply a tokenizer.fit_on_text method by:

    tokenizer.fit_on_text(text_generator)
    
like image 91
Marcin Możejko Avatar answered Oct 14 '22 12:10

Marcin Możejko