I would like to use k-fold cross validation while learning a model. So far I am doing it like this:
# splitting dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(dataset_1, df1['label'], test_size=0.25, random_state=4222)
# learning a model
model = MultinomialNB()
model.fit(X_train, y_train)
scores = cross_val_score(model, X_train, y_train, cv=5)
At this step I am not quite sure whether I should use model.fit() or not, because in the official documentation of sklearn they do not fit but just call cross_val_score as following (they do not even split the data into training and test sets):
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
I would like to tune the hyper parameters of the model while learning the model. What is the right pipeline?
Cross-validating is repeated model fitting. Each fit is done on a (major) portion of the data and is tested on the portion of the data that was left out during fitting. This is repeated until every observation is used for testing.
Multiple model comparison is also called Cross Model Validation. Here the model refers to completely different algorithms. The idea is to use multiple models constructed from the same training dataset and validated using the same verification dataset to find out the performance of the different models.
The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset. >>> from sklearn.model_selection import cross_val_score >>> clf = svm.SVC(kernel='linear', C=1, random_state=42) >>> scores = cross_val_score(clf, X, y, cv=5) >>> scores array([0.96..., 1.
Cross-validation is mainly used as a way to check for over-fit. Assuming you have determined the optimal hyper parameters of your classification technique (Let's assume random forest for now), you would then want to see if the model generalizes well across different test sets.
Your second example is right for doing the cross validation. See the example here: http://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics
The fitting will be done inside the cross_val_score
function, you don't need to worry about this beforehand.
[Edited] If, besides cross validation, you want to train a model, you can call model.fit()
afterwards.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With