Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TFDV Tensorflow Data Validation: how can I save/load the protobuf schema to/from a file

TFDV generates schema as a Schema protocol buffer. However it seems that there is no helper function to write/read schema to/from a file.

schema = tfdv.infer_schema(stats)

How can I save it/load it ?

like image 242
Vincent Teyssier Avatar asked Oct 12 '18 12:10

Vincent Teyssier


People also ask

How does TensorFlow use validation data?

TensorFlow Data Validation identifies any anomalies in the input data by comparing data statistics against a schema. The schema codifies properties which the input data is expected to satisfy, such as data types or categorical values, and can be modified or replaced by the user.

What is schema in TensorFlow?

In short, the schema describes the expectations for "correct" data and can thus be used to detect errors in the data (described below). Moreover, the same schema can be used to set up Tensorflow Transform for data transformations.


2 Answers

You can use the following methods to write/load the schema to/from a file.

from google.protobuf import text_format
from tensorflow.python.lib.io import file_io
from tensorflow_metadata.proto.v0 import schema_pb2

def write_schema(schema, output_path):
  schema_text = text_format.MessageToString(schema)
  file_io.write_string_to_file(output_path, schema_text)

def load_schema(input_path):
  schema = schema_pb2.Schema()
  schema_text = file_io.read_file_to_string(input_path)
  text_format.Parse(schema_text, schema)
  return schema      
like image 165
Paul Suganthan Avatar answered Oct 31 '22 14:10

Paul Suganthan


If you will be using it with Tensorflow Transform then I would suggest the following functions:

import tensorflow_data_validation as tfdv
from tensorflow.python.lib.io import file_io
from tensorflow_transform.tf_metadata import metadata_io

# Define file path
file_io.recursive_create_dir(OUTPUT_DIR)
schema_file = os.path.join(OUTPUT_DIR, 'schema.pbtxt')

# Write schema
tfdv.write_schema_text(schema, schema_file)

# Read schema with tfdv
schema = tfdv.load_schema_text(schema_file)

# Read schema with tensorflow_transform
schema = metadata_io.read_metadata(OUTPUT_DIR)

The output is human-readable - similar to JSON. But if you prefer to save it in plain JSON format then you can use the following:

from google.protobuf import json_format
from tensorflow.python.lib.io import file_io
from tensorflow_metadata.proto.v0 import schema_pb2

def write_schema(schema, output_path):
    schema_text = json_format.MessageToJson(schema)
    file_io.write_string_to_file(output_path, schema_text)

def load_schema(input_path):
    schema_text = file_io.read_file_to_string(input_path)
    schema = json_format.Parse(schema_text, schema_pb2.Schema())
    return schema   

Or if you don't need it to be in human-readable format you can use SerializeToString() and ParseFromString(data) for de/serialization like described here.

like image 2
Tim Smole Avatar answered Oct 31 '22 15:10

Tim Smole