After creating a tf.data.Dataset, I would like to write it to TFRecords.
One way to do that is to iterate through the complete dataset and write after serializeToString into TFRecords. But it is not the most efficient way to do it.
Are there easier ways to do this? Are there any APIs available in TF2.0?
You could use TensorFlow Datasets (tfds): this library is not only a collection of ready to use tf.data.Dataset
objects, but it is also a toolchain for the transformation of raw data to TFRecords.
Following the official guide is straightforward adding a new dataset. In short, you only have to implement the methods _info
and _generate_examples
.
In particular, the _generate_examples
is the method that is used by tfds to create rows inside the TFRecords.
Every element that _generate_examples
yields is a dictionary; every dictionary is a row in a TFRecord file.
For example (kept from the official documentation) the generate_examples
below is used by tfds to save TFRecords, each one with the records "image_description", "image", "label".
def _generate_examples(self, images_dir_path, labels):
# Read the input data out of the source files
for image_file in tf.io.gfile.listdir(images_dir_path):
...
with tf.io.gfile.GFile(labels) as f:
...
# And yield examples as feature dictionaries
for image_id, description, label in data:
yield image_id, {
"image_description": description,
"image": "%s/%s.jpeg" % (images_dir_path, image_id),
"label": label,
}
In your case, you can just use the tf.data.Dataset
object you already have, and loop through it (in the generate_examples method), and yielding the rows of the TFRecord.
In this way, tfds will take care for you of the serialization and you'll find in the ~/tensorflow_datasets
folder the TFRecord created for your dataset.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With