Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tensorflow Dataset API with HDFS

We have stored a list of *.tfrecord files in a HDFS directory. I'd like to use the new Dataset API but the only example given is to use the old filequeue and string_input_producer (https://www.tensorflow.org/deploy/hadoop). These methods make it difficult to generate epochs amongst other things.

Is there any way to use HDFS with the Dataset API?

like image 501
Lukeyb Avatar asked Dec 24 '22 10:12

Lukeyb


1 Answers

The HDFS file system layer works with both the old queue-based API and the new tf.data API. Assuming you have configured your system according to the TensorFlow/Hadoop deployment guide, you can create a dataset based on files in HDFS with the following code:

dataset = tf.data.TFRecordDataset(["hdfs://namenode:8020/path/to/file1.tfrecords",
                                   "hdfs://namenode:8020/path/to/file2.tfrecords"])
dataset = dataset.map(lambda record: tf.parse_single_example(record, ...)
# ...

Note that since HDFS is a distributed file system, you might benefit from some of the suggestions in the "Parallelize data extraction" section of the Input Pipeline performance guide.

like image 103
mrry Avatar answered Jan 26 '23 05:01

mrry