I've trained a sentiment classifier model using Keras library by following the below steps(broadly).
Now for scoring using this model, I was able to save the model to a file and load from a file. However I've not found a way to save the Tokenizer object to file. Without this I'll have to process the corpus every time I need to score even a single sentence. Is there a way around this?
The fit_on_texts method is a part of Keras tokenizer class which is used to update the internal vocabulary for the texts list. We need to call be before using other methods of texts_to_sequences or texts_to_matrix.
word_index is used to find the length of the vector made. Use this the determine the extent of you vocabulary. – Shekhar Banerjee. Aug 5, 2021 at 16:54. in simpler words, its the number of unique tokens.
This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...
The most common way is to use either pickle
or joblib
. Here you have an example on how to use pickle
in order to save Tokenizer
:
import pickle # saving with open('tokenizer.pickle', 'wb') as handle: pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL) # loading with open('tokenizer.pickle', 'rb') as handle: tokenizer = pickle.load(handle)
Tokenizer class has a function to save date into JSON format:
tokenizer_json = tokenizer.to_json() with io.open('tokenizer.json', 'w', encoding='utf-8') as f: f.write(json.dumps(tokenizer_json, ensure_ascii=False))
The data can be loaded using tokenizer_from_json
function from keras_preprocessing.text
:
with open('tokenizer.json') as f: data = json.load(f) tokenizer = tokenizer_from_json(data)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With