Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find "num_words" or vocabulary size of Keras tokenizer when one is not assigned?

So if I were to not pass num_words argument when initializing Tokenizer(), how do I find the vocabulary size after it is used to tokenize the training dataset?

Why this way, I don't want to limit the tokenizer vocabulary size to know how well my Keras model perform without it. But then I need to pass on this vocabulary size as the argument in the model's first layer definition.

like image 265
karthiks Avatar asked Nov 28 '18 18:11

karthiks


1 Answers

All the words and their indices will be stored in a dictionary which you can access it using tokenizer.word_index. Therefore, you can find the number of the unique words based on the number of elements in this dictionary:

num_words = len(tokenizer.word_index) + 1

That + 1 is because of reserving padding (i.e. index zero).

Note: This solution (obviously) is applicable when you have not set num_words argument (i.e. you don't know or want to limit the number of words), since word_index contains all the words (and not only the most frequent words) no matter you set num_words or not.

like image 140
today Avatar answered Oct 03 '22 04:10

today