I have a question related to gensim. I like to know whether it is recommended or necessary to use pickle while saving or loading a model (or multiple models), as I find scripts on GitHub that do either.
mymodel = Doc2Vec(documents, size=100, window=8, min_count=5, workers=4)
mymodel.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)
See here
Variant 1:
import pickle
# Save
mymodel.save("mymodel.pkl") # Stores *.pkl file
# Load
mymodel = pickle.load("mymodel.pkl")
Variant 2:
# Save
model.save(mymodel) # Stores *.model file
# Load
model = Doc2Vec.load(mymodel)
In gensim.utils
, it appears to me that there is a pickle function embedded: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/utils.py
def save ... try: _pickle.dump(self, fname_or_handle, protocol=pickle_protocol) ...
Goal of my question: I would be glad to learn 1) whether I need pickle (for better memory management) and 2) in case, why it's better than loading *.model files.
Thank you!
Whenever you store a model using the built-in gensim function save()
, pickle is being used regardless of the file extension. The documentation for utils tells us this:
class gensim.utils.SaveLoad
Bases: object Class which inherit from this class have save/load functions, which un/pickle them to disk. Warning This uses pickle for de/serializing, so objects must not contain unpicklable attributes, such as lambda functions etc.
So gensim will use pickle to save any model as long as the model class inherits from the gensim.utils.SaveLoad
class. In your case gensim.models.doc2vec.Doc2Vec
inherits from gensim.models.base_any2vec.BaseWordEmbeddingsModel
which in turn inherits from gensim.utils.SaveLoad
which provides the actual save()
function.
To answer your questions:
save()
function you can chose any file extension: *.model, *.pkl, *.p,
*.pickle. The saved file will be pickled.It depends what are your requirements.
When you going to use the data with Python and you don't need to change between python versions (I experienced some problems with porting from python 2 to python 3 using pickled models) a binary format will be a good choice.
If you want interoperability or this model could be used by in the other projects or by other programmers I would use gensim's save method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With