My X is as follows: EDIT1:
Unique ID. Exp start date. Value. Status.
001 01/01/2020. 4000. Closed
001 12/01/2019 4000. Archived
002 01/01/2020. 5000. Closed
002 12/01/2019 5000. Archived
I want to make sure that none of the unique IDs that were in training are included in testing. I am using sklearn test train split. Is this possible?
I believe you need GroupShuffleSplit (documentation here).
import numpy as np
from sklearn.model_selection import GroupShuffleSplit
X = np.ones(shape=(8, 2))
y = np.ones(shape=(8, 1))
groups = np.array([1, 1, 2, 2, 2, 3, 3, 3])
print(groups.shape)
gss = GroupShuffleSplit(n_splits=2, train_size=.7, random_state=42)
for train_idx, test_idx in gss.split(X, y, groups):
print("TRAIN:", train_idx, "TEST:", test_idx)
TRAIN: [2 3 4 5 6 7] TEST: [0 1]
TRAIN: [0 1 5 6 7] TEST: [2 3 4]
It can be seen from above that train/test indices are created based on the groups variable.
In your case, Unique ID. should be used as groups.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With