Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trained Machine Learning model is too big

We have trained an Extra Tree model for some regression task. Our model consists of 3 extra trees, each having 200 trees of depth 30. On top of the 3 extra trees, we use a ridge regression.

We trained our model for several hours and pickled the trained model (the entire class object), for later use. However, the size of saved trained model is too big, about 140 GB!

Is there a way to reduce the size of the saved model? Are there any configuration in pickle that could be helpful, or any alternative for pickle?

like image 382
Itack Avatar asked Apr 24 '17 15:04

Itack


People also ask

How do you reduce the size of the machine learning model?

Reduce Size of SVM Classifier Reduce the size of a full support vector machine (SVM) classifier by removing the training data. Full SVM classifiers (that is, ClassificationSVM classifiers) hold the training data. To improve efficiency, use a smaller classifier. Load the ionosphere data set.

When should I stop training ML model?

Therefore, the epoch when the validation error starts to increase is precisely when the model is overfitting to the training set and does not generalize new data correctly. This is when we need to stop our training.

What is the biggest problem with machine learning?

The number one problem facing Machine Learning is the lack of good data. While enhancing algorithms often consumes most of the time of developers in AI, data quality is essential for the algorithms to function as intended.

Can too much training data cause overfitting?

So increasing the amount of data can only make overfitting worse if you mistakenly also increase the complexity of your model. Otherwise, the performance on the test set should improve or remain the same, but not get significantly worse.


1 Answers

You can try using joblib with compression parameter.

from sklearn.externals import joblib
joblib.dump(your_algo, 'pickle_file_name.pkl', compress=3)

compress - from 0 to 9. Higher value means more compression, but also slower read and write times. Using a value of 3 is often a good compromise.

You can use python standard compression modules zlib, gzip, bz2, lzma and xz. To use that you can just specify the format with specific extension

Example:

joblib.dump(obj, 'your_filename.pkl.z')   # zlib

More information, see the link.

like image 103
Rajish sani Avatar answered Oct 17 '22 05:10

Rajish sani