I am working on an NLP problem.
I have downloaded premade embedding weights to use for an embedding layer. Before the embedding layer I need to tokenize my dataset which is currently in the form of strings of sentences. I want to tokenize it using the same indices as my premade embedding layer.
Is there a way to initialize the Keras tokenizer (tensorflow.keras.preprocessing.text.Tokenizer) with a premade dictionary of the sort: { 'the': 1, 'me': 2, 'a': 3 ..... }
so it won't decide on its own which index to give each word?
You can initialize a tokenizer object and assign the word index manually to it. You can then use it to index your sentence.
token = text.Tokenizer()
token.word_index = {"the":1, "elephant":2}
token.texts_to_sequences(["the elephant"])
This will return [[1, 2]]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With