How to save a Python list of strings for future use

Question

I just did text pre-processing of 43K documents (stop words removal/tokenization etc). in python and the result is a list of processed text documents(strings). Now I am going for converting these processed strings to bag of words feature vectors.

I need help on two things.

1). It took 45 minutes on my system to get those 43K documents text pre-processed. I don't want to again do the same stuff if I restart my system later. How do I save those list of pre-processed strings?. Should I simply save it to a txt file or should I use pickle or json?. Which is more preferable in terms of faster reading in to memory and no issues. I want to do the same for a bag of words matrix(numpy matrix).

2). I am going to run LDA or k means clustering on these bag of words matrix later. What is the best solution to persist my model so that I don't have to re-run the model again?. Pickling?

Can someone suggest the right syntax to pickle in both these cases and reading back in if pickling is the solution?

hrwhisper · Accepted Answer

I use sklearn joblib , it is faster than the other answer which use cPickle and gzip(170ms vs 430ms for my test). And the code is simple and cool. :)

to use joblib.dump to save, and joblib.load to read

from sklearn.externals import joblib
joblib.dump(clf, 'filename.pkl') 


clf = joblib.load('filename.pkl')

see more detail about it : http://scikit-learn.org/stable/modules/model_persistence.html

How to save a Python list of strings for future use

Tags:

python

numpy

Baktaawar

1 Answers

hrwhisper

Recent Activity

Donate For Us

How to save a Python list of strings for future use

Tags:

python

numpy

Baktaawar

1 Answers

hrwhisper

Related questions

Recent Activity

Donate For Us