I am using TensorFlow to train on a very large dataset, which is too large to fit in RAM. Therefore, I have split the dataset into a number of shards on the hard drive, and I am using the tf.data.Dataset
class to load the shard data into a tf.placeholder
in the GPU memory. To train across these shards, there are two ways I am considering, but I do not know which one would be best practice. They are:
1) For each epoch, load each dataset shard sequentially, and train one iteration on each shard.
2) For each epoch, load each dataset shard sequentially, and then train multiple times on each shard.
The problem with 1) is that it takes a long time to load each dataset shard from the hard drive, and since each shard is only trained with each iteration, a large proportion of the overall training time is spent loading this data. However, the problem with 2) is that training on the same shard multiple times in a row will make the optimisation more likely to converge to a local minimum.
Which is the recommended approach?
edit: Answer updates with new links.
The Dataset class is definitely designed for the use-case of data which is too large to fit in RAM. The tf.data performance guide is worth reading.
I would start by seeing if strategic use of prefetch after your data reading code + prefetch to device at the end of your dataset pipeline can help hide the latency of your "Extract" stage of the ETL process.
I would also recommend shuffling the order that the files are loaded and also use the Dataset shuffle ops to avoid the local minima you describe - ideally the examples should be in random order to begin with as well. If you are currently using python code to load your data it may be worth considering preprocessing your data into e.g. TFRecord format so that you can benefit from the native performance of TFRecordDataset.
Additional info that would be useful:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With