Is there any way to split .tfrecords file into many .tfrecords files directly, without writing back each Dataset example ?
The TFRecord format is a simple format for storing a sequence of binary records. Protocol buffers are a cross-platform, cross-language library for efficient serialization of structured data. Protocol messages are defined by . proto files, these are often the easiest way to understand a message type.
proto files, these are often the easiest way to understand a message type. The tf. Example message (or protobuf) is a flexible message type that represents a {"string": value} mapping. It is designed for use with TensorFlow and is used throughout the higher-level APIs such as TFX.
In tensorflow 2.0.0, this will work:
import tensorflow as tf
raw_dataset = tf.data.TFRecordDataset("input_file.tfrecord")
shards = 10
for i in range(shards):
writer = tf.data.experimental.TFRecordWriter(f"output_file-part-{i}.tfrecord")
writer.write(raw_dataset.shard(shards, i))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With