I have a dataframe where the column Title of the first row contains this text:
Use of hydrocolloids as cryoprotectant for frozen foods
Using this code:
vocabulary_size = 1000
tokenizer = Tokenizer(num_words=vocabulary_size)
tokenizer.fit_on_texts(df['Title'])
sequences = tokenizer.texts_to_sequences(df['Title'])
print(sequences[0])
I am getting this sequence:
[57, 1, 21, 7]
Using this:
index_word = {v: k for k, v in tokenizer.word_index.items()}
print(index_word[57])
print(index_word[1])
print(index_word[21])
print(index_word[7])
I obtain:
use
of
as
for
It makes sense, as these are the more frequent words. Is it also possible to use the Tokenizer to base the tokenisation on tf–idf?
Increasing the vocabulary_size also tokenises less frequent words like:
hydrocolloids
I intend to use glove downstream for a classification task. Does it make sense to keep frequent and thus potentially less discrimitative words like:
use
in? Maybe yes, as glove also looks at context, which is in contrast to bag of word approaches I used in the past. Here tf–idf makes sense.
Until now (because Keras always updating its functions), there is nothing can produce what you want ..
But it has a function that represent the sequences using Tf-Idf scheme instead of freq.:
sequences = tokenizer.texts_to_matrix(df['Title'], mode='tfidf')
instead of:
sequences = tokenizer.texts_to_sequences(df['Title'])
Also, as a suggestion, you can use sklearn TfidfVectorizer to filter the text from the low frequent words, then pass it to your Keras model ..
The num_words argument to Tokenizer() can help you achieve this.
Here is the description from the documentation: "the maximum number of words to keep, based on word frequency. Only the most common num_words-1 words will be kept."
The smaller the num_words you provide, the more rare words it will exclude. If you don't specify that argument, all words will be included, even the rarest ones.
When you are building your tokenizer, what you are really looking for is to account for the document frequency, which is the number of documents the word appears in. tf-idf is not applicable yet, because the term frequency refers to how many times a word appears in a particular document.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With