I have a dataframe with text columns. I separated them into <code>x_train</code> and <code>x_test</code>. My question is if its better to do Keras's <code>Tokenizer.fit_on_text()</code> on the entire <code>x</code> data set or just <code>x_train</code>? Like this: <pre class="prettyprint"><code>tokenizer = Tokenizer() tokenizer.fit_on_texts(x_data) </code></pre> or <pre class="prettyprint"><code>tokenizer.fit_on_texts(x_train) # <- fixed typo tokenizer.texts_to_sequences(x_train) </code></pre> Does it matter? I'd also have to tokenize <code>x_test</code> later too, so can I just use the same tokenizer?

Although the information in this question is good, indeed, there are more important things that you need to notice: <blockquote> You MUST use the same tokenizer in training and test data </blockquote> Otherwise, there will be different tokens for each dataset. Each tokenizer has an internal dictionary that is created with <code>fit_on_texts</code>. It's not guaranteed that train and test data will have the same words with same frequencies, so each dataset will create a different dictionary, and all results from test data will be wrong. This also means that you cannot <code>fit_on_texts</code>, train and then <code>fit_on_texts</code> again: this will change the internal dictionary. It's possible to fit on the entire data. But it's probably a better idea to reserve a token for "unknown" words (<code>oov_token=True</code>), for the cases when you find new test data with words your model has never seen (this requires that you replace rare words in training data with this token too). As @Fernando H metioned, it is probably be better to fit the tokenizer only with train data (even though, you must reserve an oov token even in training data (the model must learn what to do with the oov). <hr> <h3>Testing the tokenizer with unknown words:</h3> The following test shows that the tokenizer completely ignores unknown words when <code>oov_token</code> is not set. This might not be a good idea. Unknown words may be key words in sentences and simply ignoring them might be worse than knowing there is something unknown there. <pre class="prettyprint"><code>import numpy as np from keras.layers import * from keras.models import Model from keras.preprocessing.text import Tokenizer training = ['hey you there', 'how are you', 'i am fine thanks', 'hello there'] test = ['he is fine', 'i am fine too'] tokenizer = Tokenizer() tokenizer.fit_on_texts(training) print(tokenizer.texts_to_sequences(training)) print(tokenizer.texts_to_sequences(test)) </code></pre> Outputs: <pre class="prettyprint"><code>[[3, 1, 2], [4, 5, 1], [6, 7, 8, 9], [10, 2]] [[8], [6, 7, 8]] </code></pre> Now, this shows that the tokenizer will attibute index 1 to all unknown words: <pre class="prettyprint"><code>tokenizer2 = Tokenizer(oov_token = True) tokenizer2.fit_on_texts(training) print(tokenizer2.texts_to_sequences(training)) print(tokenizer2.texts_to_sequences(test)) </code></pre> Outputs: <pre class="prettyprint"><code>[[4, 2, 3], [5, 6, 2], [7, 8, 9, 10], [11, 3]] [[1, 1, 9], [7, 8, 9, 1]] </code></pre> But it might be interesting to have a group of rare words in training data replaced with 1 too, so your model has a notion of how to deal with unknown words.

Is it better to Keras fit_to_text on the entire x_data or just the train

I have a dataframe with text columns. I separated them into x_train and x_test.

My question is if its better to do Keras's Tokenizer.fit_on_text() on the entire x data set or just x_train?

Like this:

tokenizer = Tokenizer()
tokenizer.fit_on_texts(x_data)

or

tokenizer.fit_on_texts(x_train)        # <- fixed typo
tokenizer.texts_to_sequences(x_train)

Does it matter? I'd also have to tokenize x_test later too, so can I just use the same tokenizer?

Does keras Tokenizer remove punctuation?

By default, all punctuation is removed, turning the texts into space-separated sequences of words (words maybe include the ' character). These sequences are then split into lists of tokens.

How do you use Tokenize in keras?

Keras Tokenizer Class The Tokenizer class of Keras is used for vectorizing a text corpus. For this either, each text input is converted into integer sequence or a vector that has a coefficient for each token in the form of binary values.

Although the information in this question is good, indeed, there are more important things that you need to notice:

You MUST use the same tokenizer in training and test data

Otherwise, there will be different tokens for each dataset. Each tokenizer has an internal dictionary that is created with fit_on_texts.

It's not guaranteed that train and test data will have the same words with same frequencies, so each dataset will create a different dictionary, and all results from test data will be wrong.

This also means that you cannot fit_on_texts, train and then fit_on_texts again: this will change the internal dictionary.

It's possible to fit on the entire data. But it's probably a better idea to reserve a token for "unknown" words (oov_token=True), for the cases when you find new test data with words your model has never seen (this requires that you replace rare words in training data with this token too).

As @Fernando H metioned, it is probably be better to fit the tokenizer only with train data (even though, you must reserve an oov token even in training data (the model must learn what to do with the oov).

Testing the tokenizer with unknown words:

The following test shows that the tokenizer completely ignores unknown words when oov_token is not set. This might not be a good idea. Unknown words may be key words in sentences and simply ignoring them might be worse than knowing there is something unknown there.

import numpy as np
from keras.layers import *
from keras.models import Model
from keras.preprocessing.text import Tokenizer

training = ['hey you there', 'how are you', 'i am fine thanks', 'hello there']
test = ['he is fine', 'i am fine too']

tokenizer = Tokenizer()
tokenizer.fit_on_texts(training)

print(tokenizer.texts_to_sequences(training))
print(tokenizer.texts_to_sequences(test))

Outputs:

[[3, 1, 2], [4, 5, 1], [6, 7, 8, 9], [10, 2]]
[[8], [6, 7, 8]]

Now, this shows that the tokenizer will attibute index 1 to all unknown words:

tokenizer2 = Tokenizer(oov_token = True)
tokenizer2.fit_on_texts(training)
print(tokenizer2.texts_to_sequences(training))
print(tokenizer2.texts_to_sequences(test))

Outputs:

[[4, 2, 3], [5, 6, 2], [7, 8, 9, 10], [11, 3]]
[[1, 1, 9], [7, 8, 9, 1]]

But it might be interesting to have a group of rare words in training data replaced with 1 too, so your model has a notion of how to deal with unknown words.

Is it better to Keras fit_to_text on the entire x_data or just the train_data?

Tags:

python

tokenize

keras

The Dodo

People also ask

1 Answers

Testing the tokenizer with unknown words:

Daniel Möller

Recent Activity

Donate For Us

Is it better to Keras fit_to_text on the entire x_data or just the train_data?

Tags:

python

tokenize

keras

The Dodo

People also ask

1 Answers

Testing the tokenizer with unknown words:

Daniel Möller

Related questions

Recent Activity

Donate For Us