Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sklearn online prediction, batch vs one by one

As stated in many places, for input data with size 10000, it's much more faster to predict the whole data in batch than predict each line one by one (in both cases, model.n_jobs=1).

I know there're many overheads in the one by one solution. But in a online service, requests are coming one by one, it's hard to aggregate them first and then predict in batch.

An alternative solution is use scikit-learn for train/validation only, and develop a project to load the model file and optimize the one-by-one prediction.

The problem is that the prediction project need to know the detail of every kind of model (we may use Random Forests, LR, etc.).

So my question is there any solutions to reduce the one-by-one prediction overhead for sklearn?

scikit-learn version: 0.20.0 (You can propose any other versions can solve this problem)

like image 375
twds Avatar asked Oct 29 '22 02:10

twds


1 Answers

Yes, sklearn is optimized for vector operations. I am not aware of code specifically optimized for the online setting. It would be good to profile performance of sklearn for a single request. Some approaches like Random Forests have already been rewritten in Cython for speed. But, because Python can be slow you may need to rewrite code in C for parts that have large overhead. For approaches like GBDT, consider using optimized packages (e.g., xgboost). Also, check out these slides that discuss accelerating Random Forests: https://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn

like image 161
jagdish Avatar answered Nov 15 '22 06:11

jagdish