Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to save a Python list of strings for future use

Tags:

python

numpy

I just did text pre-processing of 43K documents (stop words removal/tokenization etc). in python and the result is a list of processed text documents(strings). Now I am going for converting these processed strings to bag of words feature vectors.

I need help on two things.

1). It took 45 minutes on my system to get those 43K documents text pre-processed. I don't want to again do the same stuff if I restart my system later. How do I save those list of pre-processed strings?. Should I simply save it to a txt file or should I use pickle or json?. Which is more preferable in terms of faster reading in to memory and no issues. I want to do the same for a bag of words matrix(numpy matrix).

2). I am going to run LDA or k means clustering on these bag of words matrix later. What is the best solution to persist my model so that I don't have to re-run the model again?. Pickling?

Can someone suggest the right syntax to pickle in both these cases and reading back in if pickling is the solution?

like image 654
Baktaawar Avatar asked Feb 08 '23 21:02

Baktaawar


1 Answers

I use sklearn joblib , it is faster than the other answer which use cPickle and gzip(170ms vs 430ms for my test). And the code is simple and cool. :)

to use joblib.dump to save, and joblib.load to read

from sklearn.externals import joblib
joblib.dump(clf, 'filename.pkl') 


clf = joblib.load('filename.pkl') 

see more detail about it : http://scikit-learn.org/stable/modules/model_persistence.html

like image 129
hrwhisper Avatar answered Feb 13 '23 22:02

hrwhisper