Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit-Learn: Test Size in timeseriessplit

Tags:

scikit-learn

I am using Scikit-Learn timeseriessplit to split my data into training and testing sets. Currently the first split of timeSeries data set is 50% and the next is 30% after that 25%. I want a fixed 10% of data to be used as testing set.

tscv = TimeSeriesSplit(n_splits=3)
for train_index, test_index in tscv.split(X):
    print(train_index, test_index)

Output is:

[   0    1    2 ..., 1067 1068 1069] [1070 1071 1072 ..., 2136 2137 2138]
[   0    1    2 ..., 2136 2137 2138] [2139 2140 2141 ..., 3205 3206 3207]
[   0    1    2 ..., 3205 3206 3207] [3208 3209 3210 ..., 4274 4275 4276]

I would like something like this: tscv = TimeSeriesSplit(n_splits=3, test_size= = 0.1) similar to train_test_split.

How can only 10% of the entries be split for tests?

like image 801
suku Avatar asked Jan 05 '23 04:01

suku


2 Answers

There is no direct parameter for you to specify the percentage. But you can modify the n_splits accordingly to get the desired result.

In the documentation it is mentioned:-

In the kth split, it returns first k folds as train set and the (k+1)th fold as test set.

Now you want the last 10% as the test and rest as train. So use the n_splits=9. It will then output the first 9 folds as train and last 1 fold as test, in the last iteration of the for loop

So change your code accordingly:

test_size = 0.1

# This conversion is found in the source of TimeSeriesSplit

n_splits = (1//test_size)-1   # using // for integer division

tscv = TimeSeriesSplit(n_splits=n_splits)
for train_index, test_index in tscv.split(X):
    print(train_index, test_index)

    # Read below comments about following code
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

If you keep the X_train, X_test etc inside the for loop, then the test size will remain at 0.1, but the train data will be changed accordingly (Because in a TimeSeries, only the values before the index of test can be used as train).

If this is kept outside of for loop, there will be only one set of train and test with 0.9 train and 0.1 test.

EDIT: I cant say why they chose k+1 as test set. Please have a look at user guide explanation here. But in the source code, they have used the test_size, calculated from n_splits:-

n_samples = _num_samples(X)
n_splits = self.n_splits
n_folds = n_splits + 1
test_size = (n_samples // n_folds)

So maybe in next versions they can have that test_size as parameter. Hope this helps. Feel free to comment here if any doubt.

like image 112
Vivek Kumar Avatar answered Jan 06 '23 17:01

Vivek Kumar


Does this get you what you want? This is one train/test split with the last 10% of rows as test set.

train_rows = round(0.9 * X.shape[0])

X_train = X.loc[:train_rows-1, :]
X_test  = X.loc[train_rows:, :]

assert X_train.shape[0] + X_test.shape[0] == X.shape[0]
like image 27
Max Power Avatar answered Jan 06 '23 18:01

Max Power