Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keras Tokenizer num_words doesn't seem to work

Tags:

>>> t = Tokenizer(num_words=3) >>> l = ["Hello, World! This is so&#$ fantastic!", "There is no other world like this one"] >>> t.fit_on_texts(l) >>> t.word_index {'fantastic': 6, 'like': 10, 'no': 8, 'this': 2, 'is': 3, 'there': 7, 'one': 11, 'other': 9, 'so': 5, 'world': 1, 'hello': 4} 

I'd have expected t.word_index to have just the top 3 words. What am I doing wrong?

like image 617
max_max_mir Avatar asked Sep 13 '17 16:09

max_max_mir


People also ask

What is Num_words in Tokenizer keras?

num_words is nothing but your vocabulary size. We need to be very cautious while selecting this parameter because this will results in the performace of the model.By default the value of num_words is none. The best value is to use for the num_words is “ len(tokenizer. word_index) + 1".

Does keras Tokenizer remove punctuation?

By default, all punctuation is removed, turning the texts into space-separated sequences of words (words maybe include the ' character).

What does keras Tokenizer method exactly do?

Keras Tokenizer Class The Tokenizer class of Keras is used for vectorizing a text corpus. For this either, each text input is converted into integer sequence or a vector that has a coefficient for each token in the form of binary values.


1 Answers

There is nothing wrong in what you are doing. word_index is computed the same way no matter how many most frequent words you will use later (as you may see here). So when you will call any transformative method - Tokenizer will use only three most common words and at the same time, it will keep the counter of all words - even when it's obvious that it will not use it later.

like image 148
Marcin Możejko Avatar answered Oct 05 '22 19:10

Marcin Możejko