Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory management in Tensorflow's Dataset API

I have a training dataset that is too big to fit into memory, so my code reads only 1,000 records from disk at a time. Now I would like to use Tensorflow's new Dataset API. Does the Dataset API allow me to specify the number of records to keep in memory or does Tensorflow automatically manage memory so that I don't have to?

like image 233
user554481 Avatar asked Jul 16 '17 03:07

user554481


People also ask

Does TF data use GPU?

TensorFlow code, and tf. keras models will transparently run on a single GPU with no code changes required. Note: Use tf. config.

What does TF data dataset from_tensor_slices do?

Dataset. from_tensor_slices() method, we can get the slices of an array in the form of objects by using tf. data.

How do I iterate over a TensorFlow dataset?

To iterate over the dataset several times, use . repeat() . We can enumerate each batch by using either Python's enumerator or a build-in method. The former produces a tensor, which is recommended.

How can TensorFlow be used to configure the dataset for performance?

Tensorflow and pre-trained model can be used to configure the dataset for performance using the 'AUTOTUNE' attribute that is present in the 'tf. Data' module. Buffered prefetching is used to ensure that the data can be taken from disk without having I/O become blocking.


2 Answers

Yes. An example from official guide (Using the Dataset API for TensorFlow Input Pipelines, https://www.tensorflow.org/programmers_guide/datasets)

filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
dataset = tf.contrib.data.TFRecordDataset(filenames)
dataset = dataset.map(...) ## Parsing data with a user specified function
dataset = dataset.shuffle(buffer_size=10000) ## 10000: size of sample/record pool for random selection
dataset = dataset.repeat() ## None: keep repeating
dataset = dataset.batch(32) ## 32: number of samples/records per batch (to be read into memory)
like image 130
Maosi Chen Avatar answered Oct 04 '22 15:10

Maosi Chen


If you will specify the number of records via batch_size. In this case TF will grab only batch_size elements from the file. You can also specify shuffle and this will guarantee that all the time in the memory will be at maximum buffer_size elements.

I verified it on my tfrecords files. I have 100 tfrecords files, each of them is ~10Gb (which is more than the memory on my laptop). And everything works fine.

like image 41
Salvador Dali Avatar answered Oct 04 '22 16:10

Salvador Dali