Running locally on a Jupyter notebook and using the MNIST dataset (28k entries, 28x28 pixels per image, the following takes 27 seconds.
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_jobs=1)
knn_clf.fit(pixels, labels)
However, the following takes 1722 seconds, in other words ~64 times longer:
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(knn_clf, pixels, labels, cv = 3, n_jobs=1)
My naive understanding is that cross_val_predict
with cv=3
is doing 3-fold cross validation, so I'd expect it to fit the model 3 times, and so take at least ~3 times longer, but I don't see why it would take 64x!
To check if it was something specific to my environment, I ran the same in a Colab notebook - the difference was less extreme (15x), but still way above the ~3x I expected:
What am I missing? Why is cross_val_predict so much slower than just fitting the model?
In case it matters, I'm running scikit-learn 0.20.2.
cross_val_score returns score of test fold where cross_val_predict returns predicted y values for the test fold. For the cross_val_score() , you are using the average of the output, which will be affected by the number of folds because then it may have some folds which may have high error (not fit correctly).
Generate cross-validated estimates for each input data point. The data is split according to the cv parameter. Each sample belongs to exactly one test set, and its prediction is computed with an estimator fitted on the corresponding training set.
Can I train my model using cross_val_score? A common question developers have is whether cross_val_score can also function as a way of training the final model. Unfortunately this is not the case. Cross_val_score is a way of assessing a model and it's parameters, and cannot be used for final training.
The function cross_val_predict has a similar interface to cross_val_score, but returns, for each element in the input, the prediction that was obtained for that element when it was in the test set.
KNN
is also called as lazy algorithm because during fitting it does nothing but saves the input data, specifically there is no learning at all.
During predict is the actual distance calculation happens for each test datapoint. Hence, you could understand that when using cross_val_predict
, KNN
has to predict on the validation data points, which makes the computation time higher!
cross_val_predict does a fit and a predict so it might take longer than just fitting, but I did not expect 64 times longer
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With