Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn cross_val_score gives lower accuracy than manual cross validation

I'm working on a text classification problem, which I've set up like so (I've left out the data processing steps for concision, but they'll produce a dataframe called data with columns X and y):

import sklearn.model_selection as ms
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier

sim = Pipeline([('vec', TfidfVectorizer((analyzer="word", ngram_range=(1, 2))),
                ("rdf", RandomForestClassifier())])

Now I try to validate this model by training it on 2/3 of the data and scoring it on the remaining 1/3, like so:

train, test = ms.train_test_split(data, test_size = 0.33)
sim.fit(train.X, train.y)
sim.score(test.X, test.y)
# 0.533333333333

I want to do this three times for three different test sets, but using cross_val_score gives me results that are much lower.

ms.cross_val_score(sim, data.X, data.y)
# [ 0.29264069  0.36729223  0.22977941]

As far as I know, each of the scores in that array should be produced by training on 2/3 of the data and scoring on the remaining 1/3 with the sim.score method. So why are they all so much lower?

like image 931
Empiromancer Avatar asked Apr 28 '17 19:04

Empiromancer


People also ask

What is the difference between Cross_validate and Cross_val_score?

The cross_validate function differs from cross_val_score in two ways: It allows specifying multiple metrics for evaluation. It returns a dict containing fit-times, score-times (and optionally training scores as well as fitted estimators) in addition to the test score.

What does sklearn Cross_val_score do?

The cross_val_score() function will be used to perform the evaluation, taking the dataset and cross-validation configuration and returning a list of scores calculated for each fold.

Does Cross_val_score train the model?

Can I train my model using cross_val_score? A common question developers have is whether cross_val_score can also function as a way of training the final model. Unfortunately this is not the case. Cross_val_score is a way of assessing a model and it's parameters, and cannot be used for final training.

How is Cross_val_score calculated?

"cross_val_score" splits the data into say 5 folds. Then for each fold it fits the data on 4 folds and scores the 5th fold. Then it gives you the 5 scores from which you can calculate a mean and variance for the score. You crossval to tune parameters and get an estimate of the score.


1 Answers

I solved this problem in the process of writing my question, so here it goes:

The default behavior for cross_val_score is to use KFold or StratifiedKFold to define the folds. By default, both have argument shuffle=False, so the folds are not pulled randomly from the data:

import numpy as np
import sklearn.model_selection as ms

for i, j in ms.KFold().split(np.arange(9)):
    print("TRAIN:", i, "TEST:", j)
TRAIN: [3 4 5 6 7 8] TEST: [0 1 2]
TRAIN: [0 1 2 6 7 8] TEST: [3 4 5]
TRAIN: [0 1 2 3 4 5] TEST: [6 7 8]

My raw data was arranged by label, so with this default behavior I was trying to predict a lot of labels I hadn't seen in the training data. This is even more pronounced if I force use of KFold (I was doing classification, so StratifiedKFold was the default):

ms.cross_val_score(sim, data.text, data.label, cv = ms.KFold())
# array([ 0.05530776,  0.05709188,  0.025     ])
ms.cross_val_score(sim, data.text, data.label, cv = ms.StratifiedKFold(shuffle = False))
# array([ 0.2978355 ,  0.35924933,  0.27205882])
ms.cross_val_score(sim, data.text, data.label, cv = ms.KFold(shuffle = True))
# array([ 0.51561106,  0.50579839,  0.51785714])
ms.cross_val_score(sim, data.text, data.label, cv = ms.StratifiedKFold(shuffle = True))
# array([ 0.52869565,  0.54423592,  0.55626715])

Doing things by hand was giving me higher scores because train_test_split was doing the same thing as KFold(shuffle = True).

like image 132
Empiromancer Avatar answered Oct 05 '22 10:10

Empiromancer