I am doing text classification based on TF-IDF
Vector Space Model.I have only no more than 3000 samples.For the fair evaluation, I'm evaluating the classifier using 5-fold cross-validation.But what confuses me is that whether it is necessary to rebuild the TF-IDF
Vector Space Model in each fold cross-validation. Namely, would I need to rebuild the vocabulary and recalculate the IDF
value in vocabulary in each fold cross-validation?
Currently I'm doing TF-IDF tranforming based on scikit-learn toolkit, and training my classifier using SVM. My method is as follows: firstly,I'm dividing the sample in hand by the ratio of 3:1, 75 percent of them are applied to fit the parameter of the TF-IDF Vector Space Model.Herein, the parameter is the size of vocabulary and the terms that contained in it, also the IDF
value of each term in vocabulary.Then I'm transforming the remainder in this TF-IDF
SVM
and using these vectors to make 5-fold cross-validation (Notably, I don't use the previous 75 percent samples for transforming).
My code is as follows:
# train, test split, the train data is just for TfidfVectorizer() fit
x_train, x_test, y_train, y_test = train_test_split(data_x, data_y, train_size=0.75, random_state=0)
tfidf = TfidfVectorizer()
tfidf.fit(x_train)
# vectorizer test data for 5-fold cross-validation
x_test = tfidf.transform(x_test)
scoring = ['accuracy']
clf = SVC(kernel='linear')
scores = cross_validate(clf, x_test, y_test, scoring=scoring, cv=5, return_train_score=False)
print(scores)
My confusion is that whether my method doing TF-IDF
transforming and making 5-fold cross-validation is correct, or whether it's necessary to rebuild the TF-IDF
Vector Model Space using train data and then transform into TF-IDF
vectors with both train and test data? Just as follows:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
for train_index, test_index in skf.split(data_x, data_y):
x_train, x_test = data_x[train_index], data_x[test_index]
y_train, y_test = data_y[train_index], data_y[test_index]
tfidf = TfidfVectorizer()
x_train = tfidf.fit_transform(x_train)
x_test = tfidf.transform(x_test)
clf = SVC(kernel='linear')
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
score = accuracy_score(y_test, y_pred)
print(score)
Repeated k-fold cross-validation provides a way to improve the estimated performance of a machine learning model. This involves simply repeating the cross-validation procedure multiple times and reporting the mean result across all folds from all runs.
Repeated K-fold is the most preferred cross-validation technique for both classification and regression machine learning models.
Iterated K-fold cross validation (aka repeated k-fold cross validation) repeats/iterates the process described in steps 1-4 a chosen number of times (e.g. 100 times). You now have 100 average scores (each as a result of repeatedly applying steps 1-4).
The StratifiedKFold
approach, which you had adopted to build the TfidfVectorizer()
is the right way, by doing so you are making sure that features are generated only based out of the training dataset.
If you think about building the TfidfVectorizer()
on the whole dataset, then its situation of leaking the test dataset to the model even though we are not explicitly feeding the test dataset. The parameters such as size of vocabulary, IDF value of each term in vocabulary would greatly differ when test documents are included.
The simpler way could be using pipeline and cross_validate.
Use this!
from sklearn.pipeline import make_pipeline
clf = make_pipeline(TfidfVectorizer(), svm.SVC(kernel='linear'))
scores = cross_validate(clf, data_x, data_y, scoring=['accuracy'], cv=5, return_train_score=False)
print(scores)
Note: It is not useful to do cross_validate
on the test data alone. we have to do on the [train + validation]
dataset.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With