Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

processing before or after train test split

I am using this excellent article to learn Machine learning.

https://stackabuse.com/python-for-nlp-multi-label-text-classification-with-keras/

The author has tokenized the X and y data after splitting it up.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_train)

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

vocab_size = len(tokenizer.word_index) + 1

maxlen = 200

X_train = pad_sequences(X_train, padding="post", maxlen=maxlen)
X_test = pad_sequences(X_test, padding="post", maxlen=maxlen)

If I tokenize it before using train_test_split class, I can save a few lines of code.

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X)

X_t = tokenizer.texts_to_sequences(X)
vocab_size = len(tokenizer.word_index) + 1
maxlen = 200

X = pad_sequences(X_t, padding="post", maxlen=maxlen)

I just want to confirm that my approach is correct and I do not expect any surprises later in the script.

like image 678
shantanuo Avatar asked Aug 28 '19 13:08

shantanuo


2 Answers

Both approaches will work in practice. But fitting the tokenizer on the train set and the applied it to both train and test set is better than fitting on the whole dataset. Indeed with the first method you are mimicking the fact that unseen words by the model will appear at some point after deploying your model. Thus your model evaluation will be closer to what will happen in a production environnement.

like image 137
Simon Delecourt Avatar answered Oct 01 '22 09:10

Simon Delecourt


Agreed with @desertnaut's comment that the question is better suited for "Cross Validated", you'll get a better response there. But I'd still like to make a remark.

TL;DR: Don't do it, it's generally not a good idea to cross contaminate your training and test set. It's not statistically correct to do so.

The Tokenizer.fit_to_texts(dictionary) does the word indexing, i.e. it builds a translation of your any sequence of words to numbers(vector representation), so it might be that vocabulary difference between the training and test set is not a null set, i.e. some of the words in test are not present in the word indexer built by the Tokenizer object if it used only the train data. Which could result in some test set generating different vector if you'd have trained your tokenizer only on training set.

Since the test sets in a learning problem are supposed to be hidden, using it during any process of training the model is statistically not correct.

like image 43
Minato Avatar answered Oct 01 '22 07:10

Minato