Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sample weights in scikit-learn broken in cross validation

I've been attempting to use weighted samples in scikit-learn while training a Random Forest classifier. It works well when I pass a sample weights to the classifier directly, e.g. RandomForestClassifier().fit(X,y,sample_weight=weights), but when I tried a grid search to find better hyperparameters for the classifier, I hit a wall:

To pass the weights when using the grid parameter, the usage is:

grid_search = GridSearchCV(RandomForestClassifier(), params, n_jobs=-1, 
                           fit_params={"sample_weight"=weights})

The problem is that the cross-validator isn't aware of sample weights and so doesn't resample them together with the the actual data, so calling grid_search.fit(X,y) fails: the cross-validator creates subsets of X and y, sub_X and sub_y and eventually a classifier is called with classifier.fit(sub_X, sub_y, sample_weight=weights) but now weights hasn't been resampled so an exception is thrown.

For now I've worked around the issue by over-sampling high-weight samples before training the classifier, but it's a temporary work-around. Any suggestions on how to proceed?

like image 365
Roee Shenberg Avatar asked Feb 19 '14 18:02

Roee Shenberg


People also ask

How do you do cross-validation with Scikit learn?

The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset. >>> from sklearn. model_selection import cross_val_score >>> clf = svm.

What does Cross_validate return?

From the scikit-learn doc: The cross_validate function differs from cross_val_score in two ways: 1. It allows specifying multiple metrics for evaluation. 2. It returns a dict containing training scores, fit-times and score-times in addition to the test score.

What is CV in Cross_val_score?

print(cross_val_score(model, X_train, y_train, cv=5)) We pass the model or classifier object, the features, the labels and the parameter cv which indicates the K for K-Fold cross-validation. The method will return a list of k accuracy values for each iteration.


2 Answers

I have too little reputation so I can't comment on @xenocyon. I'm using sklearn 0.18.1 and I'm using also pipeline in the code. The solution that worked for me was:

fit_params={'classifier__sample_weight': w} where w is the weight vector and classifier is the step name in the pipeline.

like image 167
milonimrod Avatar answered Sep 17 '22 13:09

milonimrod


Edit: the scores I see from the below don't seem quite right. This is possibly because, as mentioned above, even when weights are used in fitting they might not be getting used in scoring.

It appears that this has been fixed now. I am running sklearn version 0.15.2. My code looks something like this:

model = SGDRegressor()
parameters = {'alpha':[0.01, 0.001, 0.0001]}
cv = GridSearchCV(model, parameters, fit_params={'sample_weight': weights})
cv.fit(X, y)

Hope that helps (you and others who see this post).

like image 22
xenocyon Avatar answered Sep 20 '22 13:09

xenocyon