I came across the following statement when trying to find the differnce between train_test_split and StratifiedShuffleSplit.
When
stratifyis not None train_test_split usesStratifiedShuffleSplitinternally,
I was just wondering why the StratifiedShuffleSplit from sklearn.model_selection is used when we can use the stratify argument available in train_test_split.
Mainly, it is done for the sake of the re-usability. Rather than duplicating the code already implemented for StratifiedShuffleSplit, train_test_split just calls that class.
For the same reason, when stratify=False, it uses the model_selection.ShuffleSplit class (see source code).
Please note that duplicating code is considered a bad practice, because it assumed to inflate maintenance costs, but also considered defect-prone as inconsistent changes to code duplicates can lead to unexpected behavior. Here a reference if you'd like to learn more.
Besides, although they perform the same task, they cannot be always used in the same contexts. For example, train_test_split cannot be used within a Random or Grid search with sklearn.model_selection.RandomizedSearchCV or sklearn.model_selection.GridSearchCV.
The StratifiedShuffleSplit does. The reason is that the former is not "an iterable yielding (train, test) splits as arrays of indices". While the latter has a method split that yields (train, test) splits as array of indices.
More info here (see parameter cv).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With