The current TensorFlow dataset interleave functionality is basically a interleaved flat-map taking as input a single dataset. Given the current API, what's the best way to interleave multiple datasets together? Say they have already been constructed and I have a list of them. I want to produce elements from them alternatively and I want to support lists with more than 2 datasets (i.e., stacked zips and interleaves would be pretty ugly).
Thanks! :)
@mrry might be able to help.
To iterate over the dataset several times, use . repeat() . We can enumerate each batch by using either Python's enumerator or a build-in method. The former produces a tensor, which is recommended.
Prefetching. Prefetching overlaps the preprocessing and model execution of a training step. While the model is executing training step s , the input pipeline is reading the data for step s+1 . Doing so reduces the step time to the maximum (as opposed to the sum) of the training and the time it takes to extract the data ...
from_tensor_slices() It removes the first dimension and use it as a dataset dimension.
TL;DR interleave() parallelizes the data loading step by interleaving the I/O operation to read the file. map() will apply the data pre-processing to the contents of the datasets.
EDIT 2: See tf.contrib.data.choose_from_datasets
. It performs deterministic dataset interleaving.
EDIT: See tf.contrib.data.sample_from_datasets
. Even though it performs random sampling I guess it can be useful.
Even though this is not "clean", it is the only workaround I came up with.
datasets = [tf.data.Dataset...]
def concat_datasets(datasets):
ds0 = tf.data.Dataset.from_tensors(datasets[0])
for ds1 in datasets[1:]:
ds0 = ds0.concatenate(tf.data.Dataset.from_tensors(ds1))
return ds0
ds = tf.data.Dataset.zip(tuple(datasets)).flat_map(
lambda *args: concat_datasets(args)
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With