Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is cross_val_predict so much slower than fit for KNeighborsClassifier?

Running locally on a Jupyter notebook and using the MNIST dataset (28k entries, 28x28 pixels per image, the following takes 27 seconds.

from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_jobs=1)
knn_clf.fit(pixels, labels)

However, the following takes 1722 seconds, in other words ~64 times longer:

from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(knn_clf, pixels, labels, cv = 3, n_jobs=1)

My naive understanding is that cross_val_predict with cv=3 is doing 3-fold cross validation, so I'd expect it to fit the model 3 times, and so take at least ~3 times longer, but I don't see why it would take 64x!

To check if it was something specific to my environment, I ran the same in a Colab notebook - the difference was less extreme (15x), but still way above the ~3x I expected:

What am I missing? Why is cross_val_predict so much slower than just fitting the model?

In case it matters, I'm running scikit-learn 0.20.2.

like image 216
Dave Cahill Avatar asked Jan 22 '19 09:01

Dave Cahill


People also ask

What is the difference between Cross_val_score and cross_val_predict?

cross_val_score returns score of test fold where cross_val_predict returns predicted y values for the test fold. For the cross_val_score() , you are using the average of the output, which will be affected by the number of folds because then it may have some folds which may have high error (not fit correctly).

What is the use of cross_val_predict?

Generate cross-validated estimates for each input data point. The data is split according to the cv parameter. Each sample belongs to exactly one test set, and its prediction is computed with an estimator fitted on the corresponding training set.

Does Cross_val_score train the model?

Can I train my model using cross_val_score? A common question developers have is whether cross_val_score can also function as a way of training the final model. Unfortunately this is not the case. Cross_val_score is a way of assessing a model and it's parameters, and cannot be used for final training.

What does cross_val_predict return?

The function cross_val_predict has a similar interface to cross_val_score, but returns, for each element in the input, the prediction that was obtained for that element when it was in the test set.


2 Answers

KNN is also called as lazy algorithm because during fitting it does nothing but saves the input data, specifically there is no learning at all.

During predict is the actual distance calculation happens for each test datapoint. Hence, you could understand that when using cross_val_predict, KNN has to predict on the validation data points, which makes the computation time higher!

like image 67
Venkatachalam Avatar answered Nov 15 '22 03:11

Venkatachalam


cross_val_predict does a fit and a predict so it might take longer than just fitting, but I did not expect 64 times longer

like image 31
Louis D. Avatar answered Nov 15 '22 04:11

Louis D.