Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Keras' tokenizer with premade indexed dictionary

I am working on an NLP problem.

I have downloaded premade embedding weights to use for an embedding layer. Before the embedding layer I need to tokenize my dataset which is currently in the form of strings of sentences. I want to tokenize it using the same indices as my premade embedding layer.

Is there a way to initialize the Keras tokenizer (tensorflow.keras.preprocessing.text.Tokenizer) with a premade dictionary of the sort: { 'the': 1, 'me': 2, 'a': 3 ..... } so it won't decide on its own which index to give each word?

like image 340
Fuseques Avatar asked Mar 25 '18 12:03

Fuseques


1 Answers

You can initialize a tokenizer object and assign the word index manually to it. You can then use it to index your sentence.

token = text.Tokenizer()
token.word_index = {"the":1, "elephant":2}
token.texts_to_sequences(["the elephant"])

This will return [[1, 2]]

like image 195
Hamiz Ahmed Avatar answered Sep 22 '22 07:09

Hamiz Ahmed