So if I were to not pass num_words
argument when initializing Tokenizer()
, how do I find the vocabulary size after it is used to tokenize the training dataset?
Why this way, I don't want to limit the tokenizer vocabulary size to know how well my Keras model perform without it. But then I need to pass on this vocabulary size as the argument in the model's first layer definition.
All the words and their indices will be stored in a dictionary which you can access it using tokenizer.word_index
. Therefore, you can find the number of the unique words based on the number of elements in this dictionary:
num_words = len(tokenizer.word_index) + 1
That + 1
is because of reserving padding (i.e. index zero).
Note: This solution (obviously) is applicable when you have not set num_words
argument (i.e. you don't know or want to limit the number of words), since word_index
contains all the words (and not only the most frequent words) no matter you set num_words
or not.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With