processing before or after train test split

Question

I am using this excellent article to learn Machine learning.

https://stackabuse.com/python-for-nlp-multi-label-text-classification-with-keras/

The author has tokenized the X and y data after splitting it up.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_train)

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

vocab_size = len(tokenizer.word_index) + 1

maxlen = 200

X_train = pad_sequences(X_train, padding="post", maxlen=maxlen)
X_test = pad_sequences(X_test, padding="post", maxlen=maxlen)

If I tokenize it before using train_test_split class, I can save a few lines of code.

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X)

X_t = tokenizer.texts_to_sequences(X)
vocab_size = len(tokenizer.word_index) + 1
maxlen = 200

X = pad_sequences(X_t, padding="post", maxlen=maxlen)

I just want to confirm that my approach is correct and I do not expect any surprises later in the script.

Simon Delecourt · Accepted Answer

Both approaches will work in practice. But fitting the tokenizer on the train set and the applied it to both train and test set is better than fitting on the whole dataset. Indeed with the first method you are mimicking the fact that unseen words by the model will appear at some point after deploying your model. Thus your model evaluation will be closer to what will happen in a production environnement.

Minato · Answer

Agreed with @desertnaut's comment that the question is better suited for "Cross Validated", you'll get a better response there. But I'd still like to make a remark.

TL;DR: Don't do it, it's generally not a good idea to cross contaminate your training and test set. It's not statistically correct to do so.

The Tokenizer.fit_to_texts(dictionary) does the word indexing, i.e. it builds a translation of your any sequence of words to numbers(vector representation), so it might be that vocabulary difference between the training and test set is not a null set, i.e. some of the words in test are not present in the word indexer built by the Tokenizer object if it used only the train data. Which could result in some test set generating different vector if you'd have trained your tokenizer only on training set.

Since the test sets in a learning problem are supposed to be hidden, using it during any process of training the model is statistically not correct.

processing before or after train test split

Tags:

tokenize

nlp

keras

scikit-learn

train-test-split

shantanuo

2 Answers

Simon Delecourt

Minato

Recent Activity

Donate For Us

processing before or after train test split

Tags:

tokenize

nlp

keras

scikit-learn

train-test-split

shantanuo

2 Answers

Simon Delecourt

Minato

Related questions

Recent Activity

Donate For Us