scikit learn: train_test_split, can I ensure same splits on different datasets

Question

I understand that the train_test_split method splits a dataset into random train and test subsets. And using random_state=int can ensure we have the same splits on this dataset for each time the method is called.

My problem is slightly different.

I have two datasets, A and B, they contain identical sets of examples and the order of these examples appear in each dataset is also identical. But they key difference is that exmaples in each dataset uses a different sets of features.

I would like to test to see if the features used in A leads to better performance than features used in B. So I would like to ensure that when I call train_test_split on A and B, I can get the same splits on both datasets so that the comparison is meaningful.

Is this possible? Do I simply need to ensure the random_state in both method calls for both datasets are the same?

Thanks

eqzx · Accepted Answer

Yes, random state is enough.

>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X2 = np.hstack((X,X))
>>> X_train, X_test, _, _ = train_test_split(X,y, test_size=0.33, random_state=42)
>>> X_train2, X_test2, _, _ = train_test_split(X2,y, test_size=0.33, random_state=42)
>>> X_train
array([[4, 5],
       [0, 1],
       [6, 7]])
>>> X_train2
array([[4, 5, 4, 5],
       [0, 1, 0, 1],
       [6, 7, 6, 7]])
>>> X_test
array([[2, 3],
       [8, 9]])
>>> X_test2
array([[2, 3, 2, 3],
       [8, 9, 8, 9]])

scikit learn: train_test_split, can I ensure same splits on different datasets

Tags:

scikit-learn

Ziqi

1 Answers

eqzx

Recent Activity

Donate For Us

scikit learn: train_test_split, can I ensure same splits on different datasets

Tags:

scikit-learn

Ziqi

1 Answers

eqzx

Related questions

Recent Activity

Donate For Us