TFDV generates schema as a Schema protocol buffer. However it seems that there is no helper function to write/read schema to/from a file.
schema = tfdv.infer_schema(stats)
How can I save it/load it ?
TensorFlow Data Validation identifies any anomalies in the input data by comparing data statistics against a schema. The schema codifies properties which the input data is expected to satisfy, such as data types or categorical values, and can be modified or replaced by the user.
In short, the schema describes the expectations for "correct" data and can thus be used to detect errors in the data (described below). Moreover, the same schema can be used to set up Tensorflow Transform for data transformations.
You can use the following methods to write/load the schema to/from a file.
from google.protobuf import text_format
from tensorflow.python.lib.io import file_io
from tensorflow_metadata.proto.v0 import schema_pb2
def write_schema(schema, output_path):
schema_text = text_format.MessageToString(schema)
file_io.write_string_to_file(output_path, schema_text)
def load_schema(input_path):
schema = schema_pb2.Schema()
schema_text = file_io.read_file_to_string(input_path)
text_format.Parse(schema_text, schema)
return schema
If you will be using it with Tensorflow Transform then I would suggest the following functions:
import tensorflow_data_validation as tfdv
from tensorflow.python.lib.io import file_io
from tensorflow_transform.tf_metadata import metadata_io
# Define file path
file_io.recursive_create_dir(OUTPUT_DIR)
schema_file = os.path.join(OUTPUT_DIR, 'schema.pbtxt')
# Write schema
tfdv.write_schema_text(schema, schema_file)
# Read schema with tfdv
schema = tfdv.load_schema_text(schema_file)
# Read schema with tensorflow_transform
schema = metadata_io.read_metadata(OUTPUT_DIR)
The output is human-readable - similar to JSON. But if you prefer to save it in plain JSON format then you can use the following:
from google.protobuf import json_format
from tensorflow.python.lib.io import file_io
from tensorflow_metadata.proto.v0 import schema_pb2
def write_schema(schema, output_path):
schema_text = json_format.MessageToJson(schema)
file_io.write_string_to_file(output_path, schema_text)
def load_schema(input_path):
schema_text = file_io.read_file_to_string(input_path)
schema = json_format.Parse(schema_text, schema_pb2.Schema())
return schema
Or if you don't need it to be in human-readable format you can use SerializeToString() and ParseFromString(data) for de/serialization like described here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With