Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

use tf–idf in keras Tokenizer

I have a dataframe where the column Title of the first row contains this text:

Use of hydrocolloids as cryoprotectant for frozen foods

Using this code:

vocabulary_size = 1000
tokenizer = Tokenizer(num_words=vocabulary_size)
tokenizer.fit_on_texts(df['Title'])
sequences = tokenizer.texts_to_sequences(df['Title'])
print(sequences[0])

I am getting this sequence:

[57, 1, 21, 7]

Using this:

index_word = {v: k for k, v in tokenizer.word_index.items()}
print(index_word[57])
print(index_word[1])
print(index_word[21])
print(index_word[7])

I obtain:

use
of
as
for

It makes sense, as these are the more frequent words. Is it also possible to use the Tokenizer to base the tokenisation on tf–idf?

Increasing the vocabulary_size also tokenises less frequent words like:

hydrocolloids

I intend to use glove downstream for a classification task. Does it make sense to keep frequent and thus potentially less discrimitative words like:

use

in? Maybe yes, as glove also looks at context, which is in contrast to bag of word approaches I used in the past. Here tf–idf makes sense.

like image 592
cs0815 Avatar asked Sep 07 '18 14:09

cs0815


2 Answers

Until now (because Keras always updating its functions), there is nothing can produce what you want ..

But it has a function that represent the sequences using Tf-Idf scheme instead of freq.:

sequences = tokenizer.texts_to_matrix(df['Title'], mode='tfidf')

instead of:

sequences = tokenizer.texts_to_sequences(df['Title'])

Also, as a suggestion, you can use sklearn TfidfVectorizer to filter the text from the low frequent words, then pass it to your Keras model ..

like image 84
Minions Avatar answered Oct 13 '22 16:10

Minions


The num_words argument to Tokenizer() can help you achieve this.

Here is the description from the documentation: "the maximum number of words to keep, based on word frequency. Only the most common num_words-1 words will be kept."

The smaller the num_words you provide, the more rare words it will exclude. If you don't specify that argument, all words will be included, even the rarest ones.

When you are building your tokenizer, what you are really looking for is to account for the document frequency, which is the number of documents the word appears in. tf-idf is not applicable yet, because the term frequency refers to how many times a word appears in a particular document.

like image 2
JoAnn Alvarez Avatar answered Oct 13 '22 15:10

JoAnn Alvarez