We have stored a list of *.tfrecord files in a HDFS directory. I'd like to use the new Dataset API but the only example given is to use the old filequeue and string_input_producer (https://www.tensorflow.org/deploy/hadoop). These methods make it difficult to generate epochs amongst other things.
Is there any way to use HDFS with the Dataset API?
The HDFS file system layer works with both the old queue-based API and the new tf.data
API. Assuming you have configured your system according to the TensorFlow/Hadoop deployment guide, you can create a dataset based on files in HDFS with the following code:
dataset = tf.data.TFRecordDataset(["hdfs://namenode:8020/path/to/file1.tfrecords",
"hdfs://namenode:8020/path/to/file2.tfrecords"])
dataset = dataset.map(lambda record: tf.parse_single_example(record, ...)
# ...
Note that since HDFS is a distributed file system, you might benefit from some of the suggestions in the "Parallelize data extraction" section of the Input Pipeline performance guide.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With