I'm working on a text classification problem, which I've set up like so (I've left out the data processing steps for concision, but they'll produce a dataframe called <code>data</code> with columns <code>X</code> and <code>y</code>): <pre class="prettyprint"><code>import sklearn.model_selection as ms from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.ensemble import RandomForestClassifier sim = Pipeline([('vec', TfidfVectorizer((analyzer="word", ngram_range=(1, 2))), ("rdf", RandomForestClassifier())]) </code></pre> Now I try to validate this model by training it on 2/3 of the data and scoring it on the remaining 1/3, like so: <pre class="prettyprint"><code>train, test = ms.train_test_split(data, test_size = 0.33) sim.fit(train.X, train.y) sim.score(test.X, test.y) # 0.533333333333 </code></pre> I want to do this three times for three different test sets, but using <code>cross_val_score</code> gives me results that are much lower. <pre class="prettyprint"><code>ms.cross_val_score(sim, data.X, data.y) # [ 0.29264069 0.36729223 0.22977941] </code></pre> As far as I know, each of the scores in that array should be produced by training on 2/3 of the data and scoring on the remaining 1/3 with the <code>sim.score</code> method. So why are they all so much lower?

I solved this problem in the process of writing my question, so here it goes: The default behavior for <code>cross_val_score</code> is to use <code>KFold</code> or <code>StratifiedKFold</code> to define the folds. By default, both have argument <code>shuffle=False</code>, so the folds are not pulled randomly from the data: <pre class="prettyprint"><code>import numpy as np import sklearn.model_selection as ms for i, j in ms.KFold().split(np.arange(9)): print("TRAIN:", i, "TEST:", j) TRAIN: [3 4 5 6 7 8] TEST: [0 1 2] TRAIN: [0 1 2 6 7 8] TEST: [3 4 5] TRAIN: [0 1 2 3 4 5] TEST: [6 7 8] </code></pre> My raw data was arranged by label, so with this default behavior I was trying to predict a lot of labels I hadn't seen in the training data. This is even more pronounced if I force use of <code>KFold</code> (I was doing classification, so <code>StratifiedKFold</code> was the default): <pre class="prettyprint"><code>ms.cross_val_score(sim, data.text, data.label, cv = ms.KFold()) # array([ 0.05530776, 0.05709188, 0.025 ]) ms.cross_val_score(sim, data.text, data.label, cv = ms.StratifiedKFold(shuffle = False)) # array([ 0.2978355 , 0.35924933, 0.27205882]) ms.cross_val_score(sim, data.text, data.label, cv = ms.KFold(shuffle = True)) # array([ 0.51561106, 0.50579839, 0.51785714]) ms.cross_val_score(sim, data.text, data.label, cv = ms.StratifiedKFold(shuffle = True)) # array([ 0.52869565, 0.54423592, 0.55626715]) </code></pre> Doing things by hand was giving me higher scores because <code>train_test_split</code> was doing the same thing as <code>KFold(shuffle = True)</code>.

sklearn cross_val_score gives lower accuracy than manual cross validation

Tags:

python

python-3.x

scikit-learn

cross-validation

I'm working on a text classification problem, which I've set up like so (I've left out the data processing steps for concision, but they'll produce a dataframe called data with columns X and y):

import sklearn.model_selection as ms
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier

sim = Pipeline([('vec', TfidfVectorizer((analyzer="word", ngram_range=(1, 2))),
                ("rdf", RandomForestClassifier())])

Now I try to validate this model by training it on 2/3 of the data and scoring it on the remaining 1/3, like so:

train, test = ms.train_test_split(data, test_size = 0.33)
sim.fit(train.X, train.y)
sim.score(test.X, test.y)
# 0.533333333333

I want to do this three times for three different test sets, but using cross_val_score gives me results that are much lower.

ms.cross_val_score(sim, data.X, data.y)
# [ 0.29264069  0.36729223  0.22977941]

As far as I know, each of the scores in that array should be produced by training on 2/3 of the data and scoring on the remaining 1/3 with the sim.score method. So why are they all so much lower?

931

asked Apr 28 '17 19:04

Empiromancer

1 Answers

I solved this problem in the process of writing my question, so here it goes:

The default behavior for cross_val_score is to use KFold or StratifiedKFold to define the folds. By default, both have argument shuffle=False, so the folds are not pulled randomly from the data:

import numpy as np
import sklearn.model_selection as ms

for i, j in ms.KFold().split(np.arange(9)):
    print("TRAIN:", i, "TEST:", j)
TRAIN: [3 4 5 6 7 8] TEST: [0 1 2]
TRAIN: [0 1 2 6 7 8] TEST: [3 4 5]
TRAIN: [0 1 2 3 4 5] TEST: [6 7 8]

My raw data was arranged by label, so with this default behavior I was trying to predict a lot of labels I hadn't seen in the training data. This is even more pronounced if I force use of KFold (I was doing classification, so StratifiedKFold was the default):

ms.cross_val_score(sim, data.text, data.label, cv = ms.KFold())
# array([ 0.05530776,  0.05709188,  0.025     ])
ms.cross_val_score(sim, data.text, data.label, cv = ms.StratifiedKFold(shuffle = False))
# array([ 0.2978355 ,  0.35924933,  0.27205882])
ms.cross_val_score(sim, data.text, data.label, cv = ms.KFold(shuffle = True))
# array([ 0.51561106,  0.50579839,  0.51785714])
ms.cross_val_score(sim, data.text, data.label, cv = ms.StratifiedKFold(shuffle = True))
# array([ 0.52869565,  0.54423592,  0.55626715])

Doing things by hand was giving me higher scores because train_test_split was doing the same thing as KFold(shuffle = True).

132

answered Oct 05 '22 10:10

Empiromancer

Related questions
                            
                                FastCGI WSGI library in Python 3?
                            
                                How would one decorate an inherited method in the child class?
                            
                                Extremely slow import of matplotlib afm
                            
                                Patching sshuttle's firewall.py -- IPFW to PF [closed]
                            
                                why do perl, ruby use /dev/urandom
                            
                                Generate flattened PDF with Python
                            
                                Using MultilabelBinarizer on test data with labels not in the training set
                            
                                Saving XML using ETree in Python. It's not retaining namespaces, and adding ns0, ns1 and removing xmlns tags
                            
                                Is there a standard way to get the user config directory in python
                            
                                PyCharm doesn't detect interpreter
                            
                                Python lxml Subelement with text value?
                            
                                Pandas sorting by value and then by index
                            
                                Python click: Make some options hidden
                            
                                How to make TF-IDF matrix dense?
                            
                                How to install an older version of python
                            
                                Not nesting version of @atomic() in Django?
                            
                                Save numpy array to CSV without scientific notation
                            
                                Colormap with maximum distinguishable colours
                            
                                Luigi - Unfulfilled %s at run time
                            
                                Pandas how to concat two dataframes without losing the column headers

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

sklearn cross_val_score gives lower accuracy than manual cross validation

Tags:

python

python-3.x

scikit-learn

cross-validation

Empiromancer

People also ask

1 Answers

Empiromancer

Recent Activity

Donate For Us