Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reloading Keras Tokenizer during Testing

I followed the tutorial here: (https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html)

However, I modified the code to be able to save the generated model through h5py. Thus, after running the training script, I have a generated model.h5 in my directory.

Now, when I want to load it, my problem is that I'm confused as to how to re-initiate the Tokenizer. The tutorial has the following line of code:

tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

But hypothetically, if I reload the model.h5 in a different module, I'll need to create another Tokenizer to tokenize the test set. But then, the new Tokenizer will be fit on the test data thus creating a completely different word table.

Therefore, my question is: How do I reload the Tokenizer that was trained on the training dataset? Am I in some way misunderstanding the functionality of the Embedding layer in Keras? Right now, I'm assuming that since we mapped certain word indices to their corresponding embedding vectors based on the pre-trained word embeddings, the word indices need to be consistent. However, this is not possible if we perform another fit_on_texts on the test dataset.

Thank you and looking forward to your answers!

like image 917
Vandenn Avatar asked Jun 26 '17 13:06

Vandenn


1 Answers

Check out this question The commenter recommends using a pickle to save the object & state, though the question still remains why this kind of functionality is not built into keras.

like image 154
Miriam Herm Avatar answered Sep 23 '22 09:09

Miriam Herm