Unexpected cross-validation scores with scikit-learn LinearRegression

Question

I am trying to learn to use scikit-learn for some basic statistical learning tasks. I thought I had successfully created a LinearRegression model fit to my data:

X_train, X_test, y_train, y_test = cross_validation.train_test_split(
    X, y,
    test_size=0.2, random_state=0)

model = linear_model.LinearRegression()
model.fit(X_train, y_train)
print model.score(X_test, y_test)

Which yields:

0.797144744766

Then I wanted to do multiple similar 4:1 splits via automatic cross-validation:

model = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(model, X, y, cv=5)
print scores

And I get output like this:

[ 0.04614495 -0.26160081 -3.11299397 -0.7326256  -1.04164369]

How can the cross-validation scores be so different from the score of the single random split? They are both supposed to be using r2 scoring, and the results are the same if I pass the scoring='r2' parameter to cross_val_score.

I've tried a number of different options for the random_state parameter to cross_validation.train_test_split, and they all give similar scores in the 0.7 to 0.9 range.

I am using sklearn version 0.16.1

Aniket Schneider · Accepted Answer

It turns out that my data was ordered in blocks of different classes, and by default cross_validation.cross_val_score picks consecutive splits rather than random (shuffled) splits. I was able to solve this by specifying that the cross-validation should use shuffled splits:

model = linear_model.LinearRegression()
shuffle = cross_validation.KFold(len(X), n_folds=5, shuffle=True, random_state=0)
scores = cross_validation.cross_val_score(model, X, y, cv=shuffle)
print scores

Which gives:

[ 0.79714474  0.86636341  0.79665689  0.8036737   0.6874571 ]

This is in line with what I would expect.

Unexpected cross-validation scores with scikit-learn LinearRegression

Tags:

python

python-2.7

scikit-learn

Aniket Schneider

1 Answers

Aniket Schneider

Recent Activity

Donate For Us

Unexpected cross-validation scores with scikit-learn LinearRegression

Tags:

python

python-2.7

scikit-learn

Aniket Schneider

1 Answers

Aniket Schneider

Related questions

Recent Activity

Donate For Us