Split .tfrecords file into many .tfrecords files

Is there any way to split .tfrecords file into many .tfrecords files directly, without writing back each Dataset example ?

People also ask

What is a Tfrecord file?

The TFRecord format is a simple format for storing a sequence of binary records. Protocol buffers are a cross-platform, cross-language library for efficient serialization of structured data. Protocol messages are defined by . proto files, these are often the easiest way to understand a message type.

What is TF example?

proto files, these are often the easiest way to understand a message type. The tf. Example message (or protobuf) is a flexible message type that represents a {"string": value} mapping. It is designed for use with TensorFlow and is used throughout the higher-level APIs such as TFX.

1 Answers

In tensorflow 2.0.0, this will work:

import tensorflow as tf

raw_dataset = tf.data.TFRecordDataset("input_file.tfrecord")

shards = 10

for i in range(shards):
    writer = tf.data.experimental.TFRecordWriter(f"output_file-part-{i}.tfrecord")
    writer.write(raw_dataset.shard(shards, i))
