I am trying to load a pkl dump of my classifier from sklearn-learn.
The joblib dump does a much better compression than the cPickle dump for my object so I would like to stick with it. However, I am getting an error when trying to read the object from AWS S3.
Cases:
Note that the pkl objects for joblib and pickle are different objects dumped with their respective methods. (i.e. joblib loads only joblib.dump(obj) and pickle loads only cPickle.dump(obj).
Joblib vs cPickle code
# case 2, this works for joblib, object pushed to heroku
resources_dir = os.getcwd() + "/static/res/" # main resource directory
input = joblib.load(resources_dir + 'classifier.pkl')
# case 3, this does not work for joblib, object hosted on s3
aws_app_assets = "https://%s.s3.amazonaws.com/static/res/" % keys.AWS_BUCKET_NAME
classifier_url_s3 = aws_app_assets + 'classifier.pkl'
# does not work with raw url, IO Error
classifier = joblib.load(classifier_url_s3)
# urrllib2, can't open instance
# TypeError: coercing to Unicode: need string or buffer, instance found
req = urllib2.Request(url=classifier_url_s3)
f = urllib2.urlopen(req)
classifier = joblib.load(urllib2.urlopen(classifier_url_s3))
# but works with a cPickle object hosted on S3
classifier = cPickle.load(urllib2.urlopen(classifier_url_s3))
My app works fine in case 2, but because of very slow loading, I wanted to try and push all static files out to S3, particularly these pickle dumps. Is there something inherently different about the way joblib loads vs pickle that would cause this error?
This is my error
File "/usr/local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 409, in load
with open(filename, 'rb') as file_handle:
IOError: [Errno 2] No such file or directory: classifier url on s3
[Finished in 0.3s with exit code 1]
It is not a permissions issue as I've made all my objects on s3 public for testing and the pickle.dump objects load fine. The joblib.dump object also downloads if I directly enter the url into the browser
I could be completely missing something.
Thanks.
joblib.load() expects a name of the file present on filesystem.
Signature: joblib.load(filename, mmap_mode=None)
Parameters
-----------
filename: string
The name of the file from which to load the object
Moreover, making all your resources public might not be a good idea for other assets, even you don't mind pickled model being accessible to the world.
It is rather simple to copy object from S3 to local filesystem of your worker first:
from boto.s3.connection import S3Connection
from sklearn.externals import joblib
import os
s3_connection = S3Connection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
s3_bucket = s3_connection.get_bucket(keys.AWS_BUCKET_NAME)
local_file = '/tmp/classifier.pkl'
s3_bucket.get_key(aws_app_assets + 'classifier.pkl').get_contents_to_filename(local_file)
clf = joblib.load(local_file)
os.remove(local_file)
Hope this helped.
P.S. you can use this approach to pickle the entire sklearn
pipeline. This includes also feature imputation. Just beware of version conflicts of libraries between training and predicting.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With