Tensorflow Dataset API with HDFS

Question

We have stored a list of *.tfrecord files in a HDFS directory. I'd like to use the new Dataset API but the only example given is to use the old filequeue and string_input_producer (https://www.tensorflow.org/deploy/hadoop). These methods make it difficult to generate epochs amongst other things.

Is there any way to use HDFS with the Dataset API?

mrry · Accepted Answer

The HDFS file system layer works with both the old queue-based API and the new tf.data API. Assuming you have configured your system according to the TensorFlow/Hadoop deployment guide, you can create a dataset based on files in HDFS with the following code:

dataset = tf.data.TFRecordDataset(["hdfs://namenode:8020/path/to/file1.tfrecords",
                                   "hdfs://namenode:8020/path/to/file2.tfrecords"])
dataset = dataset.map(lambda record: tf.parse_single_example(record, ...)
# ...

Note that since HDFS is a distributed file system, you might benefit from some of the suggestions in the "Parallelize data extraction" section of the Input Pipeline performance guide.

Tensorflow Dataset API with HDFS

Tags:

tensorflow

hdfs

tensorflow-datasets

Lukeyb

1 Answers

mrry

Recent Activity

Donate For Us

Tensorflow Dataset API with HDFS

Tags:

tensorflow

hdfs

tensorflow-datasets

Lukeyb

1 Answers

mrry

Related questions

Recent Activity

Donate For Us