Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Updating Python Pickle Object

Tags:

python

pickle

I am doing a project in Machine Learning and for that I am using the pickle module of Python.

Basically, I am parsing through a huge data set which is not possible in one execution that is why I need to save the classifier object and update it in the next execution.

So my question is, when I run the program again with the new data set then will the already created pickle object be modified (or updated). If not then how can I update the same pickle object every time I run the program.

save_classifier = open("naivebayes.pickle","wb")
pickle.dump(classifier,save_classifier)
save_classifier.close()
like image 980
arqam Avatar asked Oct 18 '22 11:10

arqam


1 Answers

Unpickling your classifier object will re-create it in the same state that it was when you pickled it, so you can proceed to update it with fresh data from your data set. And at the end of the program run, you pickle the classifier again and save it to a file again. It's a Good Idea to not overwrite the same file but to keep a backup (or even better, a series of backups), in case you mess something up. That way, you can easily go back to a known good state of your classifier.

You should experiment with pickling, using a simple program and a simple object to pickle and unpickle, until you're totally confident with how this all works.


Here's a rough sketch of how to update the pickled classifier data.

import pickle
import os
from os.path import exists
# other imports required for nltk ...

picklename = "naivebayes.pickle"

# stuff to set up featuresets ...

featuresets = [(find_features(rev), category) for (rev, category) in documents]
numtrain = int(len(documents) * 90 / 100)
training_set = featuresets[:numtrain]
testing_set = featuresets[numtrain:]

# Load or create a classifier and apply training set to it
if exists(picklename):
    # Update existing classifier
    with open(picklename, "rb") as f:
        classifier = pickle.load(f)
    classifier.train(training_set)
else:
    # Create a brand new classifier    
    classifier = nltk.NaiveBayesClassifier.train(training_set)

# Create backup
if exists(picklename):
    backupname = picklename + '.bak'
    if exists(backupname):
        os.remove(backupname)
    os.rename(picklename, backupname)

# Save
with open(picklename, "wb") as f:
    pickle.dump(classifier, f)

The first time you run this program it will create a new classifier, train it with the data in training_set, then pickle classifier to "naivebayes.pickle". Each subsequent time you run this program it will load the old classifier and apply more training data to it.


BTW, if you are doing this in Python 2 you should use the much faster cPickle module; you can do that by replacing

import pickle 

with

import cPickle as pickle
like image 163
PM 2Ring Avatar answered Oct 28 '22 16:10

PM 2Ring