I just did text pre-processing of 43K documents (stop words removal/tokenization etc). in python and the result is a list of processed text documents(strings). Now I am going for converting these processed strings to bag of words feature vectors.
I need help on two things.
1). It took 45 minutes on my system to get those 43K documents text pre-processed. I don't want to again do the same stuff if I restart my system later. How do I save those list of pre-processed strings?. Should I simply save it to a txt file or should I use pickle or json?. Which is more preferable in terms of faster reading in to memory and no issues. I want to do the same for a bag of words matrix(numpy matrix).
2). I am going to run LDA or k means clustering on these bag of words matrix later. What is the best solution to persist my model so that I don't have to re-run the model again?. Pickling?
Can someone suggest the right syntax to pickle in both these cases and reading back in if pickling is the solution?
I use sklearn joblib , it is faster than the other answer which use cPickle and gzip(170ms vs 430ms for my test). And the code is simple and cool. :)
to use joblib.dump
to save,
and joblib.load
to read
from sklearn.externals import joblib
joblib.dump(clf, 'filename.pkl')
clf = joblib.load('filename.pkl')
see more detail about it : http://scikit-learn.org/stable/modules/model_persistence.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With