How to find "num_words" or vocabulary size of Keras tokenizer when one is not assigned?

Question

So if I were to not pass num_words argument when initializing Tokenizer(), how do I find the vocabulary size after it is used to tokenize the training dataset?

Why this way, I don't want to limit the tokenizer vocabulary size to know how well my Keras model perform without it. But then I need to pass on this vocabulary size as the argument in the model's first layer definition.

today · Accepted Answer

All the words and their indices will be stored in a dictionary which you can access it using tokenizer.word_index. Therefore, you can find the number of the unique words based on the number of elements in this dictionary:

num_words = len(tokenizer.word_index) + 1

That + 1 is because of reserving padding (i.e. index zero).

Note: This solution (obviously) is applicable when you have not set num_words argument (i.e. you don't know or want to limit the number of words), since word_index contains all the words (and not only the most frequent words) no matter you set num_words or not.

How to find "num_words" or vocabulary size of Keras tokenizer when one is not assigned?

Tags:

machine-learning

tokenize

deep-learning

nlp

keras

karthiks

1 Answers

today

Recent Activity

Donate For Us

How to find "num_words" or vocabulary size of Keras tokenizer when one is not assigned?

Tags:

machine-learning

tokenize

deep-learning

nlp

keras

karthiks

1 Answers

today

Related questions

Recent Activity

Donate For Us