I am trying to use the TimeSeriesSplit cross-validation strategy in sklearn version 0.18.1 with a LogisticRegression estimator. I get an error stating that:
cross_val_predict only works for partitions
The following code snippet shows how to reproduce:
from sklearn import linear_model, neighbors
from sklearn.model_selection import train_test_split, cross_val_predict, TimeSeriesSplit, KFold, cross_val_score
import pandas as pd
import numpy as np
from datetime import date, datetime
df = pd.DataFrame(data=np.random.randint(0,10,(100,5)), index=pd.date_range(start=date.today(), periods=100), columns='x1 x2 x3 x4 y'.split())
X, y = df['x1 x2 x3 x4'.split()], df['y']
score = cross_val_score(linear_model.LogisticRegression(fit_intercept=True), X, y, cv=TimeSeriesSplit(n_splits=2))
y_hat = cross_val_predict(linear_model.LogisticRegression(fit_intercept=True), X, y, cv=TimeSeriesSplit(n_splits=2), method='predict_proba')
What am I doing wrong?
There are several ways to pass the cv
argument in cross_val_score
. Here you have to pass the generator for the splits. For example
y = range(14)
cv = TimeSeriesSplit(n_splits=2).split(y)
gives a generator. With this you can generate the CV train and test index arrays. The first looks like this:
print cv.next()
(array([0, 1, 2, 3, 4, 5, 6, 7]), array([ 8, 9, 10, 11, 12, 13]))
You can also take a dataframe as input for split
.
df = pd.DataFrame(data=np.random.randint(0,10,(100,5)),
index=pd.date_range(start=date.today(),
periods=100), columns='x1 x2 x3 x4 y'.split())
cv = TimeSeriesSplit(n_splits=2).split(df)
print cv.next()
(array([ 0, 1, 2, ..., 31, 32, 33]), array([34, 35, 36, ..., 64, 65, 66]))
In your case this should work:
score = cross_val_score(linear_model.LogisticRegression(fit_intercept=True),
X, y, cv=TimeSeriesSplit(n_splits=2).split(df))
Have a look at cross_val_score and TimeSeriesSplit for details.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With