I want to store the TF-IDF matrix so I don't have to recalculate it all the time. I am using scikit-learn's TfIdfVectorizer
. Is it more efficient to pickle it or store it in a database?
Some context: I am using k-means clustering to provide document recommendation. Since new documents are added frequently, I would like to store the TF-IDF values of the documents so that I can recalculate the clusters.
Pickling (especially using joblib.dump) is good for short term storage, e.g. to save a partial results in an interactive session or ship a model from a development server to a production server.
However the pickling format is dependent on the class definitions of the models that might change from one version of scikit-learn to another.
I would recommend to write your own implementation-independant persistence model if you plan to keep the model for a long time and make it possible to load it in future versions of scikit-learn.
I would also recommend to use the HDF5 file format (for instance used in PyTables) or other database systems that have some kind of support for storing numerical arrays efficiently.
Also have a look at the internal CSR and COO datastructures for sparse matrix representation of scipy.sparse to come up with an efficient way to store those in a database.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With