As stated in many places, for input data with size 10000, it's much more faster to predict the whole data in batch than predict each line one by one (in both cases, model.n_jobs=1).
I know there're many overheads in the one by one solution. But in a online service, requests are coming one by one, it's hard to aggregate them first and then predict in batch.
An alternative solution is use scikit-learn for train/validation only, and develop a project to load the model file and optimize the one-by-one prediction.
The problem is that the prediction project need to know the detail of every kind of model (we may use Random Forests, LR, etc.).
So my question is there any solutions to reduce the one-by-one prediction overhead for sklearn?
scikit-learn version: 0.20.0 (You can propose any other versions can solve this problem)
Yes, sklearn is optimized for vector operations. I am not aware of code specifically optimized for the online setting. It would be good to profile performance of sklearn for a single request. Some approaches like Random Forests have already been rewritten in Cython for speed. But, because Python can be slow you may need to rewrite code in C for parts that have large overhead. For approaches like GBDT, consider using optimized packages (e.g., xgboost). Also, check out these slides that discuss accelerating Random Forests: https://www.slideshare.net/glouppe/accelerating-random-forests-in-scikitlearn
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With