I'm working on a text classification problem, which I've set up like so (I've left out the data processing steps for concision, but they'll produce a dataframe called data
with columns X
and y
):
import sklearn.model_selection as ms
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
sim = Pipeline([('vec', TfidfVectorizer((analyzer="word", ngram_range=(1, 2))),
("rdf", RandomForestClassifier())])
Now I try to validate this model by training it on 2/3 of the data and scoring it on the remaining 1/3, like so:
train, test = ms.train_test_split(data, test_size = 0.33)
sim.fit(train.X, train.y)
sim.score(test.X, test.y)
# 0.533333333333
I want to do this three times for three different test sets, but using cross_val_score
gives me results that are much lower.
ms.cross_val_score(sim, data.X, data.y)
# [ 0.29264069 0.36729223 0.22977941]
As far as I know, each of the scores in that array should be produced by training on 2/3 of the data and scoring on the remaining 1/3 with the sim.score
method. So why are they all so much lower?
The cross_validate function differs from cross_val_score in two ways: It allows specifying multiple metrics for evaluation. It returns a dict containing fit-times, score-times (and optionally training scores as well as fitted estimators) in addition to the test score.
The cross_val_score() function will be used to perform the evaluation, taking the dataset and cross-validation configuration and returning a list of scores calculated for each fold.
Can I train my model using cross_val_score? A common question developers have is whether cross_val_score can also function as a way of training the final model. Unfortunately this is not the case. Cross_val_score is a way of assessing a model and it's parameters, and cannot be used for final training.
"cross_val_score" splits the data into say 5 folds. Then for each fold it fits the data on 4 folds and scores the 5th fold. Then it gives you the 5 scores from which you can calculate a mean and variance for the score. You crossval to tune parameters and get an estimate of the score.
I solved this problem in the process of writing my question, so here it goes:
The default behavior for cross_val_score
is to use KFold
or StratifiedKFold
to define the folds. By default, both have argument shuffle=False
, so the folds are not pulled randomly from the data:
import numpy as np
import sklearn.model_selection as ms
for i, j in ms.KFold().split(np.arange(9)):
print("TRAIN:", i, "TEST:", j)
TRAIN: [3 4 5 6 7 8] TEST: [0 1 2]
TRAIN: [0 1 2 6 7 8] TEST: [3 4 5]
TRAIN: [0 1 2 3 4 5] TEST: [6 7 8]
My raw data was arranged by label, so with this default behavior I was trying to predict a lot of labels I hadn't seen in the training data. This is even more pronounced if I force use of KFold
(I was doing classification, so StratifiedKFold
was the default):
ms.cross_val_score(sim, data.text, data.label, cv = ms.KFold())
# array([ 0.05530776, 0.05709188, 0.025 ])
ms.cross_val_score(sim, data.text, data.label, cv = ms.StratifiedKFold(shuffle = False))
# array([ 0.2978355 , 0.35924933, 0.27205882])
ms.cross_val_score(sim, data.text, data.label, cv = ms.KFold(shuffle = True))
# array([ 0.51561106, 0.50579839, 0.51785714])
ms.cross_val_score(sim, data.text, data.label, cv = ms.StratifiedKFold(shuffle = True))
# array([ 0.52869565, 0.54423592, 0.55626715])
Doing things by hand was giving me higher scores because train_test_split
was doing the same thing as KFold(shuffle = True)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With