Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to standardize data with sklearn's cross_val_score()

Let's say I want to use a LinearSVC to perform k-fold-cross-validation on a dataset. How would I perform standardization on the data?

The best practice I have read is to build your standardization model on your training data then apply this model to the testing data.

When one uses a simple train_test_split(), this is easy as we can just do:

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

clf = svm.LinearSVC()

scalar = StandardScaler()
X_train = scalar.fit_transform(X_train)
X_test = scalar.transform(X_test)

clf.fit(X_train, y_train)
predicted = clf.predict(X_test)

How would one go about standardizing data while doing k-fold-cross-validation? The problem comes from the fact that every data point will be for training/testing so you cannot standardize everything before cross_val_score(). Wouldn't you need a different standardization for each cross validation?

The docs do not mention standardization happening internally within the function. Am I SOL?

EDIT: This post is super helpful: Python - What is exactly sklearn.pipeline.Pipeline?

like image 249
als5ev Avatar asked Jun 08 '17 22:06

als5ev


People also ask

What does sklearn Cross_val_score do?

The cross_val_score() function will be used to perform the evaluation, taking the dataset and cross-validation configuration and returning a list of scores calculated for each fold.

How is Cross_val_score calculated?

"cross_val_score" splits the data into say 5 folds. Then for each fold it fits the data on 4 folds and scores the 5th fold. Then it gives you the 5 scores from which you can calculate a mean and variance for the score. You crossval to tune parameters and get an estimate of the score.


1 Answers

You can use a Pipeline to combine both of the processes and then send it into the cross_val_score().

When the fit() is called on the pipeline, it will fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator. And during predict() (Only available if last object in pipeline is an estimator, otherwise transform()) it will apply transforms to the data, and predict with the final estimator.

Like this:

scalar = StandardScaler()
clf = svm.LinearSVC()

pipeline = Pipeline([('transformer', scalar), ('estimator', clf)])

cv = KFold(n_splits=4)
scores = cross_val_score(pipeline, X, y, cv = cv)

Check out various examples of pipeline to understand it better:

  • http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#examples-using-sklearn-pipeline-pipeline

Feel free to ask if any doubt.

like image 111
Vivek Kumar Avatar answered Oct 19 '22 20:10

Vivek Kumar