Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cross-validation in sklearn: do I need to call fit() as well as cross_val_score()?

I would like to use k-fold cross validation while learning a model. So far I am doing it like this:

# splitting dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(dataset_1, df1['label'], test_size=0.25, random_state=4222)

# learning a model
model = MultinomialNB()
model.fit(X_train, y_train)
scores = cross_val_score(model, X_train, y_train, cv=5)

At this step I am not quite sure whether I should use model.fit() or not, because in the official documentation of sklearn they do not fit but just call cross_val_score as following (they do not even split the data into training and test sets):

from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)

I would like to tune the hyper parameters of the model while learning the model. What is the right pipeline?

like image 749
torayeff Avatar asked May 14 '18 11:05

torayeff


People also ask

Does cross validate fit the model?

Cross-validating is repeated model fitting. Each fit is done on a (major) portion of the data and is tested on the portion of the data that was left out during fitting. This is repeated until every observation is used for testing.

Does cross-validation fit multiple models?

Multiple model comparison is also called Cross Model Validation. Here the model refers to completely different algorithms. The idea is to use multiple models constructed from the same training dataset and validated using the same verification dataset to find out the performance of the different models.

How do you cross validate with sklearn?

The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset. >>> from sklearn.model_selection import cross_val_score >>> clf = svm.SVC(kernel='linear', C=1, random_state=42) >>> scores = cross_val_score(clf, X, y, cv=5) >>> scores array([0.96..., 1.

Does cross Val predict fit?

Cross-validation is mainly used as a way to check for over-fit. Assuming you have determined the optimal hyper parameters of your classification technique (Let's assume random forest for now), you would then want to see if the model generalizes well across different test sets.


1 Answers

Your second example is right for doing the cross validation. See the example here: http://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics

The fitting will be done inside the cross_val_score function, you don't need to worry about this beforehand.

[Edited] If, besides cross validation, you want to train a model, you can call model.fit() afterwards.

like image 84
markus-hinsche Avatar answered Oct 20 '22 00:10

markus-hinsche