Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

gensim: pickle or not?

I have a question related to gensim. I like to know whether it is recommended or necessary to use pickle while saving or loading a model (or multiple models), as I find scripts on GitHub that do either.

mymodel = Doc2Vec(documents, size=100, window=8, min_count=5, workers=4)
      mymodel.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

See here

Variant 1:

import pickle
# Save
mymodel.save("mymodel.pkl")  # Stores *.pkl file
# Load
mymodel = pickle.load("mymodel.pkl")

Variant 2:

# Save
model.save(mymodel) # Stores *.model file
# Load
model = Doc2Vec.load(mymodel)

In gensim.utils, it appears to me that there is a pickle function embedded: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/utils.py

def save ... try: _pickle.dump(self, fname_or_handle, protocol=pickle_protocol) ...

Goal of my question: I would be glad to learn 1) whether I need pickle (for better memory management) and 2) in case, why it's better than loading *.model files.

Thank you!

like image 735
Christopher Avatar asked Jun 02 '18 09:06

Christopher


2 Answers

Whenever you store a model using the built-in gensim function save(), pickle is being used regardless of the file extension. The documentation for utils tells us this:

class gensim.utils.SaveLoad

Bases: object

Class which inherit from this class have save/load functions, which un/pickle them to disk.

Warning

This uses pickle for de/serializing, so objects must not contain unpicklable attributes, such as lambda functions etc.

So gensim will use pickle to save any model as long as the model class inherits from the gensim.utils.SaveLoad class. In your case gensim.models.doc2vec.Doc2Vec inherits from gensim.models.base_any2vec.BaseWordEmbeddingsModel which in turn inherits from gensim.utils.SaveLoad which provides the actual save() function.

To answer your questions:

  1. Yes, you need pickle unless you want to write your own function for storing your models to disk. Using pickle should not be problematic though since it is in the standard library. You won't even notice it.
  2. If you use the gensim save() function you can chose any file extension: *.model, *.pkl, *.p, *.pickle. The saved file will be pickled.
like image 104
WolfgangK Avatar answered Sep 23 '22 06:09

WolfgangK


It depends what are your requirements.

When you going to use the data with Python and you don't need to change between python versions (I experienced some problems with porting from python 2 to python 3 using pickled models) a binary format will be a good choice.

If you want interoperability or this model could be used by in the other projects or by other programmers I would use gensim's save method.

like image 27
l.augustyniak Avatar answered Sep 24 '22 06:09

l.augustyniak