Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keras Text Preprocessing - Saving Tokenizer object to file for scoring

I've trained a sentiment classifier model using Keras library by following the below steps(broadly).

  1. Convert Text corpus into sequences using Tokenizer object/class
  2. Build a model using the model.fit() method
  3. Evaluate this model

Now for scoring using this model, I was able to save the model to a file and load from a file. However I've not found a way to save the Tokenizer object to file. Without this I'll have to process the corpus every time I need to score even a single sentence. Is there a way around this?

like image 758
Rajkumar Kaliyaperumal Avatar asked Aug 17 '17 12:08

Rajkumar Kaliyaperumal


People also ask

What is Tokenizer Fit_on_texts?

The fit_on_texts method is a part of Keras tokenizer class which is used to update the internal vocabulary for the texts list. We need to call be before using other methods of texts_to_sequences or texts_to_matrix.

What is Tokenizer Word_index?

word_index is used to find the length of the vector made. Use this the determine the extent of you vocabulary. – Shekhar Banerjee. Aug 5, 2021 at 16:54. in simpler words, its the number of unique tokens.

What does Tokenizer do in Tensorflow?

This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...


2 Answers

The most common way is to use either pickle or joblib. Here you have an example on how to use pickle in order to save Tokenizer:

import pickle  # saving with open('tokenizer.pickle', 'wb') as handle:     pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)  # loading with open('tokenizer.pickle', 'rb') as handle:     tokenizer = pickle.load(handle) 
like image 64
Marcin Możejko Avatar answered Sep 18 '22 14:09

Marcin Możejko


Tokenizer class has a function to save date into JSON format:

tokenizer_json = tokenizer.to_json() with io.open('tokenizer.json', 'w', encoding='utf-8') as f:     f.write(json.dumps(tokenizer_json, ensure_ascii=False)) 

The data can be loaded using tokenizer_from_json function from keras_preprocessing.text:

with open('tokenizer.json') as f:     data = json.load(f)     tokenizer = tokenizer_from_json(data) 
like image 42
Max Avatar answered Sep 18 '22 14:09

Max