The stratify parameter asks whether you want to retain the same proportion of classes in the train and test sets that are found in the entire original dataset. For example, if there are 100 observations in the entire original dataset of which 80 are class a and 20 are class b and you set stratify = True , with a .
Stratified sampling is super easy in Scikit-learn, just add stratify=feature_name parameter to the function. To prove this works, let's split the diamonds dataset both with vanilla splits and stratification. You can see how vastly stratification changes the performance of a model.
This stratify
parameter makes a split so that the proportion of values in the sample produced will be the same as the proportion of values provided to parameter stratify
.
For example, if variable y
is a binary categorical variable with values 0
and 1
and there are 25% of zeros and 75% of ones, stratify=y
will make sure that your random split has 25% of 0
's and 75% of 1
's.
For my future self who comes here via Google:
train_test_split
is now in model_selection
, hence:
from sklearn.model_selection import train_test_split
# given:
# features: xs
# ground truth: ys
x_train, x_test, y_train, y_test = train_test_split(xs, ys,
test_size=0.33,
random_state=0,
stratify=ys)
is the way to use it. Setting the random_state
is desirable for reproducibility.
Scikit-Learn is just telling you it doesn't recognise the argument "stratify", not that you're using it incorrectly. This is because the parameter was added in version 0.17 as indicated in the documentation you quoted.
So you just need to update Scikit-Learn.
In this context, stratification means that the train_test_split method returns training and test subsets that have the same proportions of class labels as the input dataset.
Try running this code, it "just works":
from sklearn import cross_validation, datasets
iris = datasets.load_iris()
X = iris.data[:,:2]
y = iris.target
x_train, x_test, y_train, y_test = cross_validation.train_test_split(X,y,train_size=.8, stratify=y)
y_test
array([0, 0, 0, 0, 2, 2, 1, 0, 1, 2, 2, 0, 0, 1, 0, 1, 1, 2, 1, 2, 0, 2, 2,
1, 2, 1, 1, 0, 2, 1])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With