I've been attempting to use weighted samples in scikit-learn while training a Random Forest classifier. It works well when I pass a sample weights to the classifier directly, e.g. RandomForestClassifier().fit(X,y,sample_weight=weights)
, but when I tried a grid search to find better hyperparameters for the classifier, I hit a wall:
To pass the weights when using the grid parameter, the usage is:
grid_search = GridSearchCV(RandomForestClassifier(), params, n_jobs=-1,
fit_params={"sample_weight"=weights})
The problem is that the cross-validator isn't aware of sample weights and so doesn't resample them together with the the actual data, so calling grid_search.fit(X,y)
fails: the cross-validator creates subsets of X and y, sub_X and sub_y and eventually a classifier is called with classifier.fit(sub_X, sub_y, sample_weight=weights)
but now weights hasn't been resampled so an exception is thrown.
For now I've worked around the issue by over-sampling high-weight samples before training the classifier, but it's a temporary work-around. Any suggestions on how to proceed?
The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset. >>> from sklearn. model_selection import cross_val_score >>> clf = svm.
From the scikit-learn doc: The cross_validate function differs from cross_val_score in two ways: 1. It allows specifying multiple metrics for evaluation. 2. It returns a dict containing training scores, fit-times and score-times in addition to the test score.
print(cross_val_score(model, X_train, y_train, cv=5)) We pass the model or classifier object, the features, the labels and the parameter cv which indicates the K for K-Fold cross-validation. The method will return a list of k accuracy values for each iteration.
I have too little reputation so I can't comment on @xenocyon. I'm using sklearn 0.18.1 and I'm using also pipeline in the code. The solution that worked for me was:
fit_params={'classifier__sample_weight': w}
where w
is the weight vector and classifier
is the step name in the pipeline.
Edit: the scores I see from the below don't seem quite right. This is possibly because, as mentioned above, even when weights are used in fitting they might not be getting used in scoring.
It appears that this has been fixed now. I am running sklearn version 0.15.2. My code looks something like this:
model = SGDRegressor()
parameters = {'alpha':[0.01, 0.001, 0.0001]}
cv = GridSearchCV(model, parameters, fit_params={'sample_weight': weights})
cv.fit(X, y)
Hope that helps (you and others who see this post).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With