Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do I use the same Tfidf vocabulary in k-fold cross_validation

I am doing text classification based on TF-IDF Vector Space Model.I have only no more than 3000 samples.For the fair evaluation, I'm evaluating the classifier using 5-fold cross-validation.But what confuses me is that whether it is necessary to rebuild the TF-IDF Vector Space Model in each fold cross-validation. Namely, would I need to rebuild the vocabulary and recalculate the IDF value in vocabulary in each fold cross-validation?

Currently I'm doing TF-IDF tranforming based on scikit-learn toolkit, and training my classifier using SVM. My method is as follows: firstly,I'm dividing the sample in hand by the ratio of 3:1, 75 percent of them are applied to fit the parameter of the TF-IDF Vector Space Model.Herein, the parameter is the size of vocabulary and the terms that contained in it, also the IDF value of each term in vocabulary.Then I'm transforming the remainder in this TF-IDF SVM and using these vectors to make 5-fold cross-validation (Notably, I don't use the previous 75 percent samples for transforming).

My code is as follows:

# train, test split, the train data is just for TfidfVectorizer() fit
x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, train_size=0.75, random_state=0)
tfidf = TfidfVectorizer()
tfidf.fit(x_train)

# vectorizer test data for 5-fold cross-validation
x_test = tfidf.transform(x_test)

 scoring = ['accuracy']
 clf = SVC(kernel='linear')
 scores = cross_validate(clf, x_test, y_test, scoring=scoring, cv=5, return_train_score=False)
 print(scores)

My confusion is that whether my method doing TF-IDF transforming and making 5-fold cross-validation is correct, or whether it's necessary to rebuild the TF-IDF Vector Model Space using train data and then transform into TF-IDF vectors with both train and test data? Just as follows:

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
for train_index, test_index in skf.split(data_x, data_y):
    x_train, x_test = data_x[train_index], data_x[test_index]
    y_train, y_test = data_y[train_index], data_y[test_index]

    tfidf = TfidfVectorizer()
    x_train = tfidf.fit_transform(x_train)
    x_test = tfidf.transform(x_test)

    clf = SVC(kernel='linear')
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    score = accuracy_score(y_test, y_pred)
    print(score)
like image 996
lx.F Avatar asked Sep 02 '17 04:09

lx.F


People also ask

Why is K-fold cross validation repeated?

Repeated k-fold cross-validation provides a way to improve the estimated performance of a machine learning model. This involves simply repeating the cross-validation procedure multiple times and reporting the mean result across all folds from all runs.

What is repeated Kfold?

Repeated K-fold is the most preferred cross-validation technique for both classification and regression machine learning models.

Is K-fold cross validation A iterative approach?

Iterated K-fold cross validation (aka repeated k-fold cross validation) repeats/iterates the process described in steps 1-4 a chosen number of times (e.g. 100 times). You now have 100 average scores (each as a result of repeatedly applying steps 1-4).


1 Answers

The StratifiedKFold approach, which you had adopted to build the TfidfVectorizer() is the right way, by doing so you are making sure that features are generated only based out of the training dataset.

If you think about building the TfidfVectorizer() on the whole dataset, then its situation of leaking the test dataset to the model even though we are not explicitly feeding the test dataset. The parameters such as size of vocabulary, IDF value of each term in vocabulary would greatly differ when test documents are included.

The simpler way could be using pipeline and cross_validate.

Use this!

from sklearn.pipeline import make_pipeline
clf = make_pipeline(TfidfVectorizer(), svm.SVC(kernel='linear'))

scores = cross_validate(clf, data_x, data_y, scoring=['accuracy'], cv=5, return_train_score=False)
print(scores) 

Note: It is not useful to do cross_validate on the test data alone. we have to do on the [train + validation] dataset.

like image 76
Venkatachalam Avatar answered Sep 19 '22 21:09

Venkatachalam