Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Google Cloud ML-engine scikit-learn prediction probability 'predict_proba()'

Google Cloud ML-engine supports the ability to deploy scikit-learn Pipeline objects. For example a text classification Pipeline could look like the following,

classifier = Pipeline([
('vect', CountVectorizer()), 
('clf', naive_bayes.MultinomialNB())])

The classifier can be trained,

classifier.fit(train_x, train_y)

Then the classifier can be uploaded to Google Cloud Storage,

model = 'model.joblib'
joblib.dump(classifier, model)
model_remote_path = os.path.join('gs://', bucket_name, datetime.datetime.now().strftime('model_%Y%m%d_%H%M%S'), model)
subprocess.check_call(['gsutil', 'cp', model, model_remote_path], stderr=sys.stdout)

Then a Model and Version can be created, either through the Google Cloud Console, or programmatically, linking the 'model.joblib' file to the Version.

This classifier can then be used to predict new data by calling the deployed model predict endpoint,

ml = discovery.build('ml','v1')
project_id = 'projects/{}/models/{}'.format(project_name, model_name)
if version_name is not None:
    project_id += '/versions/{}'.format(version_name)
request_dict = {'instances':['Test data']}
ml_request = ml.projects().predict(name=project_id, body=request_dict).execute()

The Google Cloud ML-engine calls the predict function of the classifier and returns the predicted class. However, I would like to be able to return the confidence score. Normally this could be achieved by calling the predict_proba function of the classier, however there doesn't seem to be the option to change the called function. My question is: Is it possible to return the confidence score for a scikit-learn classifier when using the Google Cloud ML-engine? If not, would you have any recommendations as to how else to achieve this result?

Update: I've found a hacky solution. It involved overwriting the predict function of the classifier with its own predict_proba function,

nb = naive_bayes.MultinomialNB()
nb.predict = nb.predict_proba
classifier = Pipeline([
('vect', CountVectorizer()), 
('clf', nb)])

Surprisingly this works. If anyone knows of a neater solution then please let me know.

Update: Google have released a new feature (currently in beta) called Custom prediction routines. This allows you to define what code is run when a prediction request comes in. It adds more code to the solution, but it certainly less hacky.

like image 625
Alex Morgan Avatar asked Sep 03 '18 14:09

Alex Morgan


People also ask

What is the difference between predict () and predict_proba () in Scikit learn?

The predict method is used to predict the actual class while predict_proba method can be used to infer the class probabilities (i.e. the probability that a particular data point falls into the underlying classes).

What does model predict_proba () do in sklearn?

model. predict_proba() : For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by model.

How is predict_proba calculated?

The predict_proba() returns the number of votes for each class, divided by the number of trees in the forest. Your precision is exactly 1/n_estimators. If you want to see variation at the 5th digit, you will need 10**5 = 100,000 estimators, which is excessive. You normally don't want more than 100 estimators.

Which model has a predict_proba method?

Many Scikit-learn models, such as Tree-based methods, ensemble methods, kNN, and Naive Bayes have a predict_proba method; but these should really be thought of as giving scores rather than "true" probabilities.


1 Answers

The ML Engine API you are using, only has the predict method, as you can see in the documentation, so it will only do the prediction (unless you force it to do something else with the hack you mentioned).

If you want to do something else with your trained model, you’ll have to load it and use it normally. If you want to use the model stored in Cloud Storage you can do something like:

from google.cloud import storage
from sklearn.externals import joblib

bucket_name = "<BUCKET_NAME>"
gs_model = "path/to/model.joblib"  # path in your Cloud Storage bucket
local_model = "/path/to/model.joblib"  # path in your local machine

client = storage.Client()
bucket = client.get_bucket(bucket_name)
blob = bucket.blob(gs_model)
blob.download_to_filename(local_model)

model = joblib.load(local_model)
model.predict_proba(test_data)
like image 94
rilla Avatar answered Oct 15 '22 09:10

rilla