I am doing a project in Machine Learning and for that I am using the pickle
module of Python.
Basically, I am parsing through a huge data set which is not possible in one execution that is why I need to save the classifier object and update it in the next execution.
So my question is, when I run the program again with the new data set then will the already created pickle object be modified (or updated). If not then how can I update the same pickle object every time I run the program.
save_classifier = open("naivebayes.pickle","wb")
pickle.dump(classifier,save_classifier)
save_classifier.close()
Unpickling your classifier
object will re-create it in the same state that it was when you pickled it, so you can proceed to update it with fresh data from your data set. And at the end of the program run, you pickle the classifier
again and save it to a file again. It's a Good Idea to not overwrite the same file but to keep a backup (or even better, a series of backups), in case you mess something up. That way, you can easily go back to a known good state of your classifier
.
You should experiment with pickling, using a simple program and a simple object to pickle and unpickle, until you're totally confident with how this all works.
Here's a rough sketch of how to update the pickled classifier
data.
import pickle
import os
from os.path import exists
# other imports required for nltk ...
picklename = "naivebayes.pickle"
# stuff to set up featuresets ...
featuresets = [(find_features(rev), category) for (rev, category) in documents]
numtrain = int(len(documents) * 90 / 100)
training_set = featuresets[:numtrain]
testing_set = featuresets[numtrain:]
# Load or create a classifier and apply training set to it
if exists(picklename):
# Update existing classifier
with open(picklename, "rb") as f:
classifier = pickle.load(f)
classifier.train(training_set)
else:
# Create a brand new classifier
classifier = nltk.NaiveBayesClassifier.train(training_set)
# Create backup
if exists(picklename):
backupname = picklename + '.bak'
if exists(backupname):
os.remove(backupname)
os.rename(picklename, backupname)
# Save
with open(picklename, "wb") as f:
pickle.dump(classifier, f)
The first time you run this program it will create a new classifier
, train it with the data in training_set
, then pickle classifier
to "naivebayes.pickle". Each subsequent time you run this program it will load the old classifier
and apply more training data to it.
BTW, if you are doing this in Python 2 you should use the much faster cPickle
module; you can do that by replacing
import pickle
with
import cPickle as pickle
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With