Converting google-cloud-ml github Reddit example from regression to classification and adding keys?

I've been trying to adapt the reddit_tft example from the cloud-ml github samples repo to my needs.

I've been able to get it running as per the tutorial readme.

However what i want to use it for is a binary classification problem and also output keys in batch prediction.

So i have made copy of the tutorial code here and have changed it in a few places to be able to have a model type of deep_classifier that would use a DNNClasifier instead of a DNNRegressor.

I've changed the score variable to be

if(score>0,1,0) as score

It's training fine, deploys to cloud ml but i'm not sure how to now get keys back from my predictions. `

I've updated the sql pulling from BigQuery to include id as example_id here

It seems the code from the tutorial had some sort of placeholder for example_id so i'm trying to leverage that.

It all seems to work but when i get batch predictions all i get is json like this:

{"classes": ["0", "1"], "scores": [0.20427155494689941, 0.7957285046577454]} {"classes": ["0", "1"], "scores": [0.14911963045597076, 0.8508803248405457]} ...

So example_id does not seem to be making it into the serving functions like i need.

I've tried to follow the approach here which is based on adapting the census example for keys.

I just cant figure out how to finish adapting this reddit example to also output keys in the predictions as they look a bit different to me in terms of design and functions being used.

Update 1

My latest attempt is here Trying to use the approach outlined here.

However this is giving errors:

NotFoundError (see above for traceback): /tmp/tmp2jllvb/model.ckpt-1_temp_9530d2c5823d4462be53fa5415e429fd; No such file or directory
     [[Node: save/SaveV2 = SaveV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:ps/replica:0/task:0/device:CPU:0"](save/ShardedFilename, save/SaveV2/tensor_names, save/SaveV2/shape_and_slices, dnn/hiddenlayer_0/kernel/part_2/read, dnn/dnn/hiddenlayer_0/kernel/part_2/Adagrad/read, dnn/hiddenlayer_1/kernel/part_2/read, dnn/dnn/hiddenlayer_1/kernel/part_2/Adagrad/read, dnn/input_from_feature_columns/input_layer/subreddit_id_embedding/weights/part_0/read, dnn/dnn/input_from_feature_columns/input_layer/subreddit_id_embedding/weights/part_0/Adagrad/read, dnn/logits/bias/part_0/read, dnn/dnn/logits/bias/part_0/Adagrad/read, global_step)]]

Update 2

My latest attempt and details are here.

I'm now getting a error from tensorflow-fransform (run_preprocess.sh works fine in tft 0.1)

File "/usr/local/lib/python2.7/dist-packages/tensorflow_transform/tf_metadata/dataset_schema.py", line 282, in __setstate__ self._dtype = tf.as_dtype(state['dtype']) TypeError: string indices must be integers, not str

Update 3

I have changed things to just use beam + csv and avoid tft. Also i'm now using the approach as outlined here for extending the canned estimator to get the key back with the predictions.

However when following this post to try get the comments in as features i'm now running into a new error.

The replica worker 3 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): [...] File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/estimator/python/estimator/extenders.py", line 87, in new_model_fn spec = estimator.model_fn(features, labels, mode, config) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 203, in public_model_fn return self._call_model_fn(features, labels, mode, config) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 694, in _call_model_fn model_fn_results = self._model_fn(features=features, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/canned/dnn_linear_combined.py", line 520, in _model_fn config=config) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/canned/dnn_linear_combined.py", line 158, in _dnn_linear_combined_model_fn dnn_logits = dnn_logit_fn(features=features, mode=mode) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/canned/dnn.py", line 89, in dnn_logit_fn features=features, feature_columns=feature_columns) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/feature_column/feature_column.py", line 226, in input_layer with variable_scope.variable_scope(None, default_name=column.name): File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 1826, in __enter__ current_name_scope_name = self._current_name_scope.__enter__() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 4932, in __enter__ return self._name_scope.__enter__() File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3514, in name_scope raise ValueError("'%s' is not a valid scope name" % name) ValueError: 'Tensor("Slice:0", shape=(?, 20), dtype=int64)_embedding' is not a valid scope name

My repo for this attempt/approach is here. This all runs fine if i just use subreddit as a feature, it's adding in the comment feature that seems to be causing the problems. Lines 103 to 111 is where i have followed this approach.

Not sure what's triggering the error in my code from reading the trace. Anyone any ideas?

Or can anyone point me towards another approach to go from text to bow to embedding feature in TF?

How to use cloud natural language API on GitHub?

You can find the sample text files in the resources/texts folder of the GitHub project. To use the Cloud Natural Language API, you must to import the language module from the google-cloud-language library. The language.types module contains classes that are required for creating requests.

How do you convert a regression to a classification?

To add to the number of methods you can use to convert your regression problem into a classification problem, you can use discretised percentiles to define categories instead of numerical values. For example, from this you can then predict if the price is in the top 10th (20th, 30th, etc.) percentile.

How do I use BigQuery mL to perform logistic regression?

The model schema lists the attributes that BigQuery ML used to perform logistic regression. The schema should look like the following: After creating your model, evaluate the performance of the model using the ML.EVALUATE function. The ML.EVALUATE function evaluates the predicted values against the actual data.

How do you use binary logistic regression to classify data?

You can use the binary logistic regression model type to predict whether a value falls into one of two categories; or, you can use the multi-class regression model type to predict whether a value falls into one of multiple categories. These are known as classification problems, because they attempt to classify data into two or more categories.

See:

https://medium.com/@lakshmanok/how-to-extend-a-canned-tensorflow-estimator-to-add-more-evaluation-metrics-and-to-pass-through-ddf66cd3047d

Here's what the code looks like to pass through keys:

def forward_key_to_export(estimator):
    estimator = tf.contrib.estimator.forward_features(estimator, KEY_COLUMN)

    ## This shouldn't be necessary (I've filed CL/187793590 to update extenders.py with this code)
    config = estimator.config
    def model_fn2(features, labels, mode):
      estimatorSpec = estimator._call_model_fn(features, labels, mode, config=config)
      if estimatorSpec.export_outputs:
        for ekey in ['predict', 'serving_default']:
          estimatorSpec.export_outputs[ekey] = \
            tf.estimator.export.PredictOutput(estimatorSpec.predictions)
      return estimatorSpec
    return tf.estimator.Estimator(model_fn=model_fn2, config=config)
    ##

# Create estimator to train and evaluate
def train_and_evaluate(output_dir):
    estimator = tf.estimator.DNNLinearCombinedRegressor(...)
    estimator = forward_key_to_export(estimator)
    ...
    tf.estimator.train_and_evaluate(estimator, ...)

Converting google-cloud-ml github Reddit example from regression to classification and adding keys?