Most scalable way for using generators with tf.data ? tf.data guide says `from_generator` has limited scalability

Tags:

tf.data has a from_generator initializer, it doesn't seem like it's scalable. From the official guide

Caution: While this is a convienient approach it has limited portability and scalibility. It must run in the same python process that created the generator, and is still subject to the Python GIL.

https://www.tensorflow.org/guide/data#consuming_python_generators

And in the official documentation

NOTE: The current implementation of Dataset.from_generator() uses tf.numpy_function and inherits the same constraints. In particular, it requires the Dataset- and Iterator-related operations to be placed on a device in the same process as the Python program that called Dataset.from_generator(). The body of generator will not be serialized in a GraphDef, and you should not use this method if you need to serialize your model and restore it in a different environment.

NOTE: If generator depends on mutable global variables or other external state, be aware that the runtime may invoke generator multiple times (in order to support repeating the Dataset) and at any time between the call to Dataset.from_generator() and the production of the first element from the generator. Mutating global variables or external state can cause undefined behavior, and we recommend that you explicitly cache any external state in generator before calling Dataset.from_generator().

https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_generator

However, generators are the a fairly common method in training over very large amounts of data. So there must be some alternative best practice for this, but the official Tensorflow data guide doesn't not give any information on this.

803

asked Dec 01 '19 23:12

SantoshGupta7

1 Answers

Iterate through your generator and write the data to a TFRecord. Then use TFRecordDataset. This is the guide.

https://www.tensorflow.org/tutorials/load_data/tfrecord

TF is built to use these types of Datasets effectively with multi-gpu.

Sharding the data to disk also improves shuffling.

102

answered Oct 08 '22 19:10

Yaoshiang

Related questions
                            
                                Debugging a Neural Network
                            
                                Numpy Apply Along Axis and Get Row Index
                            
                                (Installing Python 3.6.1) SSLError: SSL: TLSV1_ALERT_UNKNOWN_CA tlsv1 alert unknown ca
                            
                                Text[Multi-Level] Classification with many outputs
                            
                                Temporary images with Pyglet
                            
                                How to use the latest sqlite3 version in python
                            
                                Proxy Pooling System for Scrapy to temporarily stop using slow/timing out proxies
                            
                                How to use py_func with a function that returns dict
                            
                                What does "Broker transport failure" mean in kafka?
                            
                                Weird behaviour with groupby on ordered categorical columns
                            
                                Simulation of t copula in Python
                            
                                Showing cropped image in bokeh
                            
                                Google Cloud ML-engine scikit-learn prediction probability 'predict_proba()'
                            
                                Errors packaging app for android using ubuntu and buildozer
                            
                                How can I construct a Pandas DataFrame from individual 1D Numpy arrays without copying
                            
                                Change code while debugging python program in Visual Studio Code
                            
                                Is there an equivalent of kable (R) on python?
                            
                                How to connect a Jupyter Notebook to a Spyder kernel?
                            
                                Extracting the license plate parallelogram from the surrounding bounding box?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Most scalable way for using generators with tf.data ? tf.data guide says `from_generator` has limited scalability

Tags:

python

generator

tensorflow

keras

tf.keras

SantoshGupta7

People also ask

1 Answers

Yaoshiang

Recent Activity

Donate For Us