Apply TensorFlow Transform to transform/scale features in production

Overview

I followed the following guide to write TF Records, where I used tf.Transform to preprocess my features. Now, I would like to deploy my model, for which I need apply this preprocessing function on real live data.

My Approach

First, suppose I have 2 features:

Click to copy

features = ['amount', 'age']

I have the transform_fn from the Apache Beam, residing in working_dir=gs://path-to-transform-fn/

Then I load the transform function using:

tf_transform_output = tft.TFTransformOutput(working_dir)

I thought that the easiest way to serve in in production was to get a numpy array of processed data, and call model.predict() (I am using Keras model).

To do this, I thought transform_raw_features() method is exactly what I need.

However, it seems that after building the schema:

Click to copy

raw_features = {}
for k in features:
    raw_features.update({k: tf.constant(1)})

print(tf_transform_output.transform_raw_features(raw_features))

I get:

Click to copy

AttributeError: 'Tensor' object has no attribute 'indices'

Now, I am assuming this happens because I used tf.VarLenFeature() when I defined schema in my preprocessing_fn.

Click to copy

def preprocessing_fn(inputs):
    outputs = inputs.copy()

    for _ in features:
        outputs[_] = tft.scale_to_z_score(outputs[_])

And I build the metadata using:

Click to copy

RAW_DATA_FEATURE_SPEC = {}
for _ in features:
    RAW_DATA_FEATURE_SPEC[_] = tf.VarLenFeature(dtype=tf.float32)
    RAW_DATA_METADATA = dataset_metadata.DatasetMetadata(
    dataset_schema.from_feature_spec(RAW_DATA_FEATURE_SPEC))

So in short, given a dictionary:

d = {'amount': [50], 'age': [32]}, I would like to apply this transform_fn, and scale these values appropriately to input into my model for prediction. This dictionary is exactly the format of my PCollection before the data is processed by the pre_processing() function.

Pipeline Structure:

Click to copy

class BeamProccess():

def __init__(self):

    # init 

    self.run()


def run(self):

    def preprocessing_fn(inputs):

         # outputs = { 'id' : [list], 'amount': [list], 'age': [list] }
         return outputs

    with beam.Pipeline(options=self.pipe_opt) as p:
        with beam_impl.Context(temp_dir=self.google_cloud_options.temp_location):
            data = p | "read_table" >> beam.io.Read(table_bq) \
            | "create_data" >> beam.ParDo(ProcessFn())

            transformed_dataset, transform_fn = (
                        (train, RAW_DATA_METADATA) | beam_impl.AnalyzeAndTransformDataset(
                    preprocessing_fn))

            transformed_data, transformed_metadata = transformed_dataset

            transformed_data | "WriteTrainTFRecords" >> tfrecordio.WriteToTFRecord(
                    file_path_prefix=self.JOB_DIR + '/train/data',
                    file_name_suffix='.tfrecord',
                    coder=example_proto_coder.ExampleProtoCoder(transformed_metadata.schema))

            _ = (
                        transform_fn
                        | 'WriteTransformFn' >>
                        transform_fn_io.WriteTransformFn(path=self.JOB_DIR + '/transform/'))

And finally the ParDo() is:

Click to copy

class ProcessFn(beam.DoFn):

    def process(self, element):

        yield { 'id' : [list], 'amount': [list], 'age': [list] }

306

asked Jan 07 '19 20:01

user10430178

1 Answers

The problem is with the snippet

Click to copy

raw_features = {}
for k in features:
    raw_features.update({k: tf.constant(1)})

print(tf_transform_output.transform_raw_features(raw_features))

In this code you construct a dictionary where the values are tensors. Like you said, this won't work for a VarLenFeature. Instead of using tf.constant try using tf.placeholder for a a FixedLenFeature and tf.sparse_placeholder for a VarLenFeature.

178

answered Sep 17 '22 19:09

Kester Tong

Related questions
                            
                                only algorithm code 1 and 2 are supported
                            
                                How to install Python using the "embeddable zip file"
                            
                                How to patch an asynchronous class method?
                            
                                How to write a pandas dataframe to CSV file line by line, one line at a time?
                            
                                Covering 2D plots with 3D surface in python
                            
                                Installing PyTorch under conda fails with permissions error and Rolling back transaction
                            
                                About unique=True and (unique=True, index=True) in sqlalchemy
                            
                                Plot datetime.time in seaborn
                            
                                Python - Overloading asynchronous methods
                            
                                Plotting numpy array using Seaborn
                            
                                Pandas any() returning false with true values present
                            
                                Dask: Drop NAs on columns?
                            
                                Django shortcut get_object_or_404 inside Django Rest framework Class Based Views
                            
                                How does tensorflow handle non differentiable nodes during gradient calculation?
                            
                                swagger flask restplus, upload a file and take json input together
                            
                                Copying weights of a specific layer - keras
                            
                                What is the difference between Model.train_on_batch from keras and Session.run([train_optimizer]) from tensorflow?
                            
                                In Python, why do properties take priority over instance attributes?
                            
                                Why I got an error in my gitlab CI with Pip which is not found?
                            
                                Global query timeout in MySQL 5.6

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apply TensorFlow Transform to transform/scale features in production

Tags:

python

tensorflow

apache-beam

tensorflow-serving

tensorflow-transform

Overview

My Approach

Pipeline Structure:

user10430178

People also ask

1 Answers

Kester Tong

Recent Activity

Donate For Us