I am trying to design an input pipeline with Dataset API. I am working with parquet files. What is a good way to add them to my pipeline?
The tf. data API enables you to build complex input pipelines from simple, reusable pieces. For example, the pipeline for an image model might aggregate data from files in a distributed file system, apply random perturbations to each image, and merge randomly selected images into a batch for training.
Prefetching overlaps the preprocessing and model execution of a training step. While the model is executing training step s , the input pipeline is reading the data for step s+1 . Doing so reduces the step time to the maximum (as opposed to the sum) of the training and the time it takes to extract the data.
from_tensor_slices creates a dataset with a separate element for each row of the input tensor: >>> t = tf.constant([[1, 2], [3, 4]]) >>> ds = tf.data.Dataset.from_tensor_slices(t) >>> [x for x in ds] [<tf.Tensor: shape=(2,), dtype=int32, numpy=array([1, 2], dtype=int32)>, <tf.Tensor: shape=(2,), dtype=int32, numpy= ...
We have released Petastorm, an open source library that allows you to use Apache Parquet files directly via Tensorflow Dataset API.
Here is a small example:
with Reader('hdfs://.../some/hdfs/path') as reader:
dataset = make_petastorm_dataset(reader)
iterator = dataset.make_one_shot_iterator()
tensor = iterator.get_next()
with tf.Session() as sess:
sample = sess.run(tensor)
print(sample.id)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With