Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you save a model, dictionary and corpus to disk in Gensim, and then load them again?

Tags:

python

nlp

gensim

In Gensim's documentation, it says:

You can save trained models to disk and later load them back, either to continue training on new training documents or to transform new documents.

I would like to do this with a dictionary, corpus and tf.idf model. However, the documentation seems to say that it is possible, without explaining how to save these things and load them back up again.

How do you do this?


I've been using Pickle, but don't know if this is right...

import pickle
pickle.dump(tfidf, open("tfidf.p", "wb"))
tfidf_reloaded = pickle.load(open("tfidf.p", "rb"))
like image 505
Data Avatar asked Nov 20 '19 19:11

Data


2 Answers

In general, you can save things with generic Python pickle, but most gensim models support their own native .save() method.

It takes a target filesystem path, and will save the model more efficiently than pickle() – often by placing large component arrays in separate files, alongside the main file. (When you later move the saved model, keep all these files with the same root name together.)

In particular, some models which have multi-gigabyte subcomponents may not save at all with pickle() – but gensim's native .save() will work.

Models saved with .save() can typically be loaded by using the appropriate class's .load() method. (For example if you've saved a instance of gensim.corpora.dictionary.Dictionary, you'd load it with gensim.corpora.dictionary.Dictionary.load(filepath).

like image 179
gojomo Avatar answered Sep 28 '22 08:09

gojomo


Saving the Dict and Corpus to disk

dictionary.save(DICT_PATH)
corpora.MmCorpus.serialize(CORPUS_PATH, corpus)

Loading the Dict and Corpus from disk

loaded_dict = corpora.Dictionary.load(DICT_PATH)
loaded_corp = corpora.MmCorpus(CORPUS_PATH)
like image 37
BHA Bilel Avatar answered Sep 28 '22 07:09

BHA Bilel