Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which is better when reading from remote hosts like HDFS, TFRecordDataset+num_parallel_read? Or parallel_interleave

Goal is to efficiently read data from remote (e.g. HDFS). With tensorflow dataset, I can either follow the guide here and use parallel_interleave to read from different files in remote host, like so

def input_fn():
  files = tf.data.Dataset.list_files("hdfs:///path/to/dataset/train-*.tfrecord")
  dataset = filenames.apply(
      tf.data.experimental.parallel_interleave(
          lambda filename: tf.data.TFRecordDataset(filename),
          cycle_length=4))
  dataset = dataset.map(map_func=parse_fn)
  dataset = dataset.batch(batch_size=FLAGS.batch_size)
  dataset = dataset.prefetch(buffer_size=FLAGS.prefetch_buffer_size)
  return dataset

Or I can use num_parallel_reads, link, to read from different files in remote host, like so

def input_fn():
  files = tf.data.Dataset.list_files("hdfs:///path/to/dataset/train-*.tfrecord")
  dataset = tf.data.TFRecordDataset(files, num_parallel_reads=4)
  dataset = dataset.map(map_func=parse_fn)
  dataset = dataset.batch(batch_size=FLAGS.batch_size)
  dataset = dataset.prefetch(buffer_size=FLAGS.prefetch_buffer_size)
  return dataset

I assume they would both serve the same purpose, where 4 threads of my cpu are going to fetch data from 4 different files, therefore have better throughput then reading 1 file. Is there a difference in this case to go for either approach?

I also assume that the first method would read from different files on each batch, more like a breadth-first search of my remote files, while the second approach is more like a depth-first search of my remote files. When it's local filesystem with low latency, maybe it doesn't matter, but for remote, like HDFS, which should be the preferred way to go?

like image 238
Kevin Yen Avatar asked Oct 27 '25 14:10

Kevin Yen


1 Answers

I just went through the source code of both TFRecordDataset and parallel_interleave. Note that I am looking at tf.data.experimental, as the tf.contrib.data one is deprecated. Funnily enough, they both call on the same class, ParallelInterleaveDataset to make use of parallel reading. I guess it then becomes the option of how better you can optimize your pipeline because you can use parameters like block_length, sloppy, buffer_output_elements and prefetch_input_elements when using parallel_interleave to potentially speed up your pipeline, while also imparting some randomness in ordering.

like image 126
kvish Avatar answered Oct 30 '25 07:10

kvish