Parallel threads with TensorFlow Dataset API and flat_map

Tags:

I'm changing my TensorFlow code from the old queue interface to the new Dataset API. With the old interface I could specify the num_threads argument to the tf.train.shuffle_batch queue. However, the only way to control the amount of threads in the Dataset API seems to be in the map function using the num_parallel_calls argument. However, I'm using the flat_map function instead, which doesn't have such an argument.

Question: Is there a way to control the number of threads/processes for the flat_map function? Or is there are way to use map in combination with flat_map and still specify the number of parallel calls?

Note that it is of crucial importance to run multiple threads in parallel, as I intend to run heavy pre-processing on the CPU before data enters the queue.

There are two (here and here) related posts on GitHub, but I don't think they answer this question.

Here is a minimal code example of my use-case for illustration:

with tf.Graph().as_default():
    data = tf.ones(shape=(10, 512), dtype=tf.float32, name="data")
    input_tensors = (data,)

    def pre_processing_func(data_):
        # normally I would do data-augmentation here
        results = (tf.expand_dims(data_, axis=0),)
        return tf.data.Dataset.from_tensor_slices(results)

    dataset_source = tf.data.Dataset.from_tensor_slices(input_tensors)
    dataset = dataset_source.flat_map(pre_processing_func)
    # do something with 'dataset'

483

asked Nov 21 '17 10:11

CNugteren

1 Answers

To the best of my knowledge, at the moment flat_map does not offer parallelism options. Given that the bulk of the computation is done in pre_processing_func, what you might use as a workaround is a parallel map call followed by some buffering, and then using a flat_map call with an identity lambda function that takes care of flattening the output.

In code:

NUM_THREADS = 5
BUFFER_SIZE = 1000

def pre_processing_func(data_):
    # data-augmentation here
    # generate new samples starting from the sample `data_`
    artificial_samples = generate_from_sample(data_)
    return atificial_samples

dataset_source = (tf.data.Dataset.from_tensor_slices(input_tensors).
                  map(pre_processing_func, num_parallel_calls=NUM_THREADS).
                  prefetch(BUFFER_SIZE).
                  flat_map(lambda *x : tf.data.Dataset.from_tensor_slices(x)).
                  shuffle(BUFFER_SIZE)) # my addition, probably necessary though

Note (to myself and whoever will try to understand the pipeline):

Since pre_processing_func generates an arbitrary number of new samples starting from the initial sample (organised in matrices of shape (?, 512)), the flat_map call is necessary to turn all the generated matrices into Datasets containing single samples (hence the tf.data.Dataset.from_tensor_slices(x) in the lambda) and then flatten all these datasets into one big Dataset containing individual samples.

It's probably a good idea to .shuffle() that dataset, or generated samples will be packed together.

answered Sep 21 '22 03:09

GPhilo

Related questions
                            
                                Django TypeError: __init__() takes 1 positional argument but 2 were given
                            
                                Subtract two dataframe with the same name different index
                            
                                How to efficiently parallelize time series forecasting using dask?
                            
                                How to get a list of TestReports at the end of a py.test run?
                            
                                Python script to use data from Azure Storage Blob by stream, and update blob by stream without local file reading and uploading
                            
                                Django email not working - smtplib.SMTPServerDisconnected: Connection unexpectedly closed
                            
                                How to implement FIPS_mode() and FIPS_mode_set() in Python 3.6's ssl module?
                            
                                How could I sort the coordinates according to the serpentine in the image?
                            
                                Boto3 not uploading zip file to S3 python
                            
                                Python - Google OAuth2 - Wrong number of segments in token
                            
                                How to detect rectangle in a rectangle?
                            
                                Working with binary PNG images in PIL/pillow
                            
                                Webhooks for slot filling
                            
                                Construct python dict from DeepDiff result
                            
                                Determine the window size turtle python setup
                            
                                Resolve a variable name given only a stack frame object
                            
                                Python Pillow's thumbnail method returning None
                            
                                How does data normalization work in keras during prediction?
                            
                                Detecting Mouse clicks in windows using python
                            
                                Python - How do you run a .py file?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parallel threads with TensorFlow Dataset API and flat_map

Tags:

python

tensorflow

CNugteren

People also ask

1 Answers

Note (to myself and whoever will try to understand the pipeline):

GPhilo

Recent Activity

Donate For Us