I am new to scikit, and have 2 slight issues to combine a data scale and grid search. <ol> <li>Efficient scaler</li> </ol> Considering a cross validation using Kfolds, I would like that each time we train the model on the K-1 folds, the data scaler (using preprocessing.StandardScaler() for instance) is fit only on the K-1 folds and then apply to the remaining fold. My impression is that the following code, will fit the scaler on the entire dataset, and therefore I would like to modify it to behave as described previsouly: <pre class="prettyprint"><code>classifier = svm.SVC(C=1) clf = make_pipeline(preprocessing.StandardScaler(), classifier) tuned_parameters = [{'C': [1, 10, 100, 1000]}] my_grid_search = GridSearchCV(clf, tuned_parameters, cv=5) </code></pre> <ol start="2"> <li>Retrieve inner scaler fitting</li> </ol> When refit=True, "after" the Grid Search, the model is refit (using the best estimator) on the entire dataset, my understanding is that the pipeline will be used again, and therefore the scaler will be fit on the entire dataset. Ideally I would like to reuse that fit to scale my 'test' dataset. Is there a way to retrieve it directly from the GridSearchCV?

<ol> <li>GridSearchCV knows nothing about the Pipeline object; it assumes that the provided estimator is atomic in the sense that it cannot choose only some particular stage (StandartScaler for example) and fit different stages on different data. All GridSearchCV does - calls fit(X, y) method on the provided estimator, where X,y - some splits of data. Thus it fits all stages on same splits.</li> <li> Try this: <code>best_pipeline = my_grid_search.best_estimator_ best_scaler = best_pipeline["standartscaler"]</code> </li> <li>In case when you wrap your transformers/estimators into Pipeline - you have to add a prefix to a name of each parameter, e.g: <code>tuned_parameters = [{'svc__C': [1, 10, 100, 1000]}]</code>, look at these examples for more details Concatenating multiple feature extraction methods, Pipelining: chaining a PCA and a logistic regression</li> </ol> Anyway read this, it may help you GridSearchCV

Scikit - Combining scale and grid search

Tags:

python

scikit-learn

cross-validation

grid-search

I am new to scikit, and have 2 slight issues to combine a data scale and grid search.

Efficient scaler

Considering a cross validation using Kfolds, I would like that each time we train the model on the K-1 folds, the data scaler (using preprocessing.StandardScaler() for instance) is fit only on the K-1 folds and then apply to the remaining fold.

My impression is that the following code, will fit the scaler on the entire dataset, and therefore I would like to modify it to behave as described previsouly:

classifier = svm.SVC(C=1)    
clf = make_pipeline(preprocessing.StandardScaler(), classifier)
tuned_parameters = [{'C': [1, 10, 100, 1000]}]
my_grid_search = GridSearchCV(clf, tuned_parameters, cv=5)

Retrieve inner scaler fitting

When refit=True, "after" the Grid Search, the model is refit (using the best estimator) on the entire dataset, my understanding is that the pipeline will be used again, and therefore the scaler will be fit on the entire dataset. Ideally I would like to reuse that fit to scale my 'test' dataset. Is there a way to retrieve it directly from the GridSearchCV?

769

asked Dec 03 '15 04:12

cpeusteuche

1 Answers

GridSearchCV knows nothing about the Pipeline object; it assumes that the provided estimator is atomic in the sense that it cannot choose only some particular stage (StandartScaler for example) and fit different stages on different data. All GridSearchCV does - calls fit(X, y) method on the provided estimator, where X,y - some splits of data. Thus it fits all stages on same splits.
Try this:

best_pipeline = my_grid_search.best_estimator_ best_scaler = best_pipeline["standartscaler"]
In case when you wrap your transformers/estimators into Pipeline - you have to add a prefix to a name of each parameter, e.g: tuned_parameters = [{'svc__C': [1, 10, 100, 1000]}], look at these examples for more details Concatenating multiple feature extraction methods, Pipelining: chaining a PCA and a logistic regression

Anyway read this, it may help you GridSearchCV

answered Sep 21 '22 14:09

Ibraim Ganiev

Related questions
                            
                                ImportError: No module named 'nose'
                            
                                Define context variables in behave python
                            
                                How to add to a list type in Python Eve without replacing old values
                            
                                How to iterate this tree/graph
                            
                                How to setup numpy in jython
                            
                                Relation extraction via chunking using NLTK
                            
                                Custom code on pip uninstall
                            
                                Size of BoundingBox/ROI to track object keeps on increasing despite fixed initial size
                            
                                Kivy 1.9.0 Windows package KeyError: 'rthooks'
                            
                                django deploying separate web & api endpoints on heroku
                            
                                Sort list by attribute of list [duplicate]
                            
                                Error installing scikits.audiolab when using python setup.py egg_info
                            
                                python (sympy) implicit function: get values instead of plot?
                            
                                how to add flask-login to flask-admin
                            
                                Django - get names of parameters needed to reverse url
                            
                                Mezzanine - Can't load css and js in Heroku
                            
                                conditional graph in tensorflow and for loop that accesses tensor size
                            
                                python-requests post with unicode filenames
                            
                                Get the diff details of first commit in GitPython
                            
                                How to detect system ACPI G2/S5 Soft Off event with python on linux

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With