Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Differnce between train_test_split and StratifiedShuffleSplit

I came across the following statement when trying to find the differnce between train_test_split and StratifiedShuffleSplit.

When stratify is not None train_test_split uses StratifiedShuffleSplit internally,

I was just wondering why the StratifiedShuffleSplit from sklearn.model_selection is used when we can use the stratify argument available in train_test_split.

like image 261
skaarfacee Avatar asked Dec 18 '25 21:12

skaarfacee


1 Answers

Mainly, it is done for the sake of the re-usability. Rather than duplicating the code already implemented for StratifiedShuffleSplit, train_test_split just calls that class. For the same reason, when stratify=False, it uses the model_selection.ShuffleSplit class (see source code).

Please note that duplicating code is considered a bad practice, because it assumed to inflate maintenance costs, but also considered defect-prone as inconsistent changes to code duplicates can lead to unexpected behavior. Here a reference if you'd like to learn more.

Besides, although they perform the same task, they cannot be always used in the same contexts. For example, train_test_split cannot be used within a Random or Grid search with sklearn.model_selection.RandomizedSearchCV or sklearn.model_selection.GridSearchCV. The StratifiedShuffleSplit does. The reason is that the former is not "an iterable yielding (train, test) splits as arrays of indices". While the latter has a method split that yields (train, test) splits as array of indices. More info here (see parameter cv).

like image 140
s.dallapalma Avatar answered Dec 20 '25 16:12

s.dallapalma



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!