I've been attempting to use weighted samples in scikit-learn while training a Random Forest classifier. It works well when I pass a sample weights to the classifier directly, e.g. <code>RandomForestClassifier().fit(X,y,sample_weight=weights)</code>, but when I tried a grid search to find better hyperparameters for the classifier, I hit a wall: To pass the weights when using the grid parameter, the usage is: <pre class="prettyprint"><code>grid_search = GridSearchCV(RandomForestClassifier(), params, n_jobs=-1, fit_params={"sample_weight"=weights}) </code></pre> The problem is that the cross-validator isn't aware of sample weights and so doesn't resample them together with the the actual data, so calling <code>grid_search.fit(X,y)</code> fails: the cross-validator creates subsets of X and y, sub_X and sub_y and eventually a classifier is called with <code>classifier.fit(sub_X, sub_y, sample_weight=weights)</code> but now weights hasn't been resampled so an exception is thrown. For now I've worked around the issue by over-sampling high-weight samples before training the classifier, but it's a temporary work-around. Any suggestions on how to proceed?

I have too little reputation so I can't comment on @xenocyon. I'm using sklearn 0.18.1 and I'm using also pipeline in the code. The solution that worked for me was: <code>fit_params={'classifier__sample_weight': w}</code> where <code>w</code> is the weight vector and <code>classifier</code> is the step name in the pipeline.

sample weights in scikit-learn broken in cross validation

Tags:

python

machine-learning

scikit-learn

I've been attempting to use weighted samples in scikit-learn while training a Random Forest classifier. It works well when I pass a sample weights to the classifier directly, e.g. RandomForestClassifier().fit(X,y,sample_weight=weights), but when I tried a grid search to find better hyperparameters for the classifier, I hit a wall:

To pass the weights when using the grid parameter, the usage is:

grid_search = GridSearchCV(RandomForestClassifier(), params, n_jobs=-1, 
                           fit_params={"sample_weight"=weights})

The problem is that the cross-validator isn't aware of sample weights and so doesn't resample them together with the the actual data, so calling grid_search.fit(X,y) fails: the cross-validator creates subsets of X and y, sub_X and sub_y and eventually a classifier is called with classifier.fit(sub_X, sub_y, sample_weight=weights) but now weights hasn't been resampled so an exception is thrown.

For now I've worked around the issue by over-sampling high-weight samples before training the classifier, but it's a temporary work-around. Any suggestions on how to proceed?

365

asked Feb 19 '14 18:02

Roee Shenberg

2 Answers

I have too little reputation so I can't comment on @xenocyon. I'm using sklearn 0.18.1 and I'm using also pipeline in the code. The solution that worked for me was:

fit_params={'classifier__sample_weight': w} where w is the weight vector and classifier is the step name in the pipeline.

167

answered Sep 17 '22 13:09

milonimrod

Edit: the scores I see from the below don't seem quite right. This is possibly because, as mentioned above, even when weights are used in fitting they might not be getting used in scoring.

It appears that this has been fixed now. I am running sklearn version 0.15.2. My code looks something like this:

model = SGDRegressor()
parameters = {'alpha':[0.01, 0.001, 0.0001]}
cv = GridSearchCV(model, parameters, fit_params={'sample_weight': weights})
cv.fit(X, y)

Hope that helps (you and others who see this post).

answered Sep 20 '22 13:09

xenocyon

Related questions
                            
                                packaging common python namespaces
                            
                                Pyramid: simpleform or deform?
                            
                                Using MongoDB as our master database, should I use a separate graph database to implement relationships between entities?
                            
                                Eggs in path before PYTHONPATH environment variable
                            
                                Can the pudb debugger be used on windows?
                            
                                Passing variables between Python and Javascript
                            
                                Get a list of python packages used by a Django Project
                            
                                Writing functions that accept both 1-D and 2-D numpy arrays?
                            
                                Catching exceptions in django templates
                            
                                Stackless in PyPy and PyPy + greenlet - differences
                            
                                git cannot execute python-script as hook
                            
                                How to stop a python socket.accept() call?
                            
                                Conditional shebang line for different versions of Python
                            
                                OpenCV - imread(), imwrite() increases the size of png?
                            
                                Using methods defined in __init__.py within the module
                            
                                Combining websockets and WSGI in a python app
                            
                                Load Excel file into numpy 2D array
                            
                                How to transmit Android real-time sensor data to computer?
                            
                                python imaging library: Can I simply fill my image with one color?
                            
                                Why Python need rich comparison?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With