Keras: Tokenizer with fit_generator() on text data

Question

I am creating a neural net on a very large text dataset using keras. To build the model and make sure everything was working, I read a fraction of the data into memory, and use the built in keras 'Tokenizer' to do the necessary preprocessing stuff, including mapping each word to a token. Then, I use model.fit().

Now, I want to extend to the full dataset, and don't have the space to read all the data into memory. So, I'd like to make a generator function to sequentially read data from disk, and use model.fit_generator(). However, if I do this, then I separately fit a Tokenizer object on each batch of data, providing a different word-to-token mapping for each batch. Is there anyway around this? Is there any way I can continuously build a token dictionary using keras?

Marcin Możejko · Accepted Answer

So basically you could define a text generator and feed it to fit_on_text method in a following manner:

Assuming that you have texts_generator which is reading partially your data from disk and returning an iterable collection of text you may define:
```
def text_generator(texts_generator):
    for texts in texts_generator:
        for text in texts:
            yield text
```
Please take care that you should make this generator stop after reading a whole of data from disk - what could possible make you to change the original generator you want to use in model.fit_generator
Once you have the generator from 1. you may simply apply a tokenizer.fit_on_text method by:
```
tokenizer.fit_on_text(text_generator)
```

Keras: Tokenizer with fit_generator() on text data

Tags:

python

keras

Ben F

1 Answers

Marcin Możejko

Recent Activity

Donate For Us

Keras: Tokenizer with fit_generator() on text data

Tags:

python

keras

Ben F

1 Answers

Marcin Możejko

Related questions

Recent Activity

Donate For Us