I'm trying to understand using kfolds cross validation from the sklearn python module.
I understand the basic flow:
model = LogisticRegression()
model.fit(xtrain, ytrain)
model.predict(ytest)
Where i'm confused is using sklearn kfolds with cross val score. As I understand it the cross_val_score function will fit the model and predict on the kfolds giving you an accuracy score for each fold.
e.g. using code like this:
kf = KFold(n=data.shape[0], n_folds=5, shuffle=True, random_state=8)
lr = linear_model.LogisticRegression()
accuracies = cross_val_score(lr, X_train,y_train, scoring='accuracy', cv = kf)
So if I have a dataset with training and testing data, and I use the cross_val_score
function with kfolds to determine the accuracy of the algorithm on my training data for each fold, is the model
now fitted and ready for prediction on the testing data?
So in the case above using lr.predict
cross_val_score. Evaluate a score by cross-validation.
The cross_validate function differs from cross_val_score in two ways: It allows specifying multiple metrics for evaluation. It returns a dict containing fit-times, score-times (and optionally training scores as well as fitted estimators) in addition to the test score.
Cross-validation in your case would build k estimators (assuming k-fold CV) and then you could check the predictive power and variance of the technique on your data as following: mean of the quality measure. Higher, the better. standard_deviation of the quality measure.
"cross_val_score" splits the data into say 5 folds. Then for each fold it fits the data on 4 folds and scores the 5th fold. Then it gives you the 5 scores from which you can calculate a mean and variance for the score. You crossval to tune parameters and get an estimate of the score.
No the model is not fitted. Looking at the source code for cross_val_score
:
scores=parallel(delayed(_fit_and_score)(clone(estimator),X,y,scorer, train,test,verbose,None,fit_params)
As you can see, cross_val_score
clones the estimator before fitting the fold training data to it. cross_val_score
will give you output an array of scores which you can analyse to know how the estimator performs for different folds of the data to check if it overfits the data or not. You can know more about it here
You need to fit the whole training data to the estimator once you are satisfied with the results of cross_val_score
, before you can use it to predict on test data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With