How to standardize data with sklearn's cross_val_score()

Tags:

Let's say I want to use a LinearSVC to perform k-fold-cross-validation on a dataset. How would I perform standardization on the data?

The best practice I have read is to build your standardization model on your training data then apply this model to the testing data.

When one uses a simple train_test_split(), this is easy as we can just do:

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

clf = svm.LinearSVC()

scalar = StandardScaler()
X_train = scalar.fit_transform(X_train)
X_test = scalar.transform(X_test)

clf.fit(X_train, y_train)
predicted = clf.predict(X_test)

How would one go about standardizing data while doing k-fold-cross-validation? The problem comes from the fact that every data point will be for training/testing so you cannot standardize everything before cross_val_score(). Wouldn't you need a different standardization for each cross validation?

The docs do not mention standardization happening internally within the function. Am I SOL?

EDIT: This post is super helpful: Python - What is exactly sklearn.pipeline.Pipeline?

249

asked Jun 08 '17 22:06

als5ev

1 Answers

You can use a Pipeline to combine both of the processes and then send it into the cross_val_score().

When the fit() is called on the pipeline, it will fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator. And during predict() (Only available if last object in pipeline is an estimator, otherwise transform()) it will apply transforms to the data, and predict with the final estimator.

Like this:

scalar = StandardScaler()
clf = svm.LinearSVC()

pipeline = Pipeline([('transformer', scalar), ('estimator', clf)])

cv = KFold(n_splits=4)
scores = cross_val_score(pipeline, X, y, cv = cv)

Check out various examples of pipeline to understand it better:

http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#examples-using-sklearn-pipeline-pipeline

Feel free to ask if any doubt.

111

answered Oct 19 '22 20:10

Vivek Kumar

Related questions
                            
                                Flask app context for sqlalchemy
                            
                                Python pandas dataframe interpolate missing data
                            
                                How to create OS X app with Python on Windows
                            
                                Connecting to PostgreSQL database through SSH tunneling in Python
                            
                                Plot colored polygons with geodataframe in folium
                            
                                Python: sqlite no matching distribution found for sqlite
                            
                                Bad Request (400) using Django, Heroku, and Name.com
                            
                                Python can't open symlinked file
                            
                                How to set the R_HOME environment variable to the R home directory?
                            
                                Concise way to filter data in xarray
                            
                                pandas: get the value of the index for a row?
                            
                                Python hash() function on strings
                            
                                Name of a Python function in a stack trace
                            
                                How to create an async generator in Python?
                            
                                How to apply pos_tag_sents() to pandas dataframe efficiently
                            
                                How to access Slack's Interactive Message request payload parameter?
                            
                                Difference between Linear Regression Coefficients between Python and R
                            
                                How to access "__" (double underscore) variables in methods added to a class
                            
                                How can I create a language independent library using Python?
                            
                                SQLAlchemy - Multiple Foreign key pointing to same table same attribute

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to standardize data with sklearn's cross_val_score()

Tags:

python

svm

scikit-learn

standardized

cross-validation

als5ev

People also ask

1 Answers

Vivek Kumar

Recent Activity

Donate For Us