Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Training on datasets too big to fit in RAM

Tags:

tensorflow

I am using TensorFlow to train on a very large dataset, which is too large to fit in RAM. Therefore, I have split the dataset into a number of shards on the hard drive, and I am using the tf.data.Dataset class to load the shard data into a tf.placeholder in the GPU memory. To train across these shards, there are two ways I am considering, but I do not know which one would be best practice. They are:

1) For each epoch, load each dataset shard sequentially, and train one iteration on each shard.

2) For each epoch, load each dataset shard sequentially, and then train multiple times on each shard.

The problem with 1) is that it takes a long time to load each dataset shard from the hard drive, and since each shard is only trained with each iteration, a large proportion of the overall training time is spent loading this data. However, the problem with 2) is that training on the same shard multiple times in a row will make the optimisation more likely to converge to a local minimum.

Which is the recommended approach?

like image 385
Karnivaurus Avatar asked Nov 08 '22 03:11

Karnivaurus


1 Answers

edit: Answer updates with new links.

The Dataset class is definitely designed for the use-case of data which is too large to fit in RAM. The tf.data performance guide is worth reading.

I would start by seeing if strategic use of prefetch after your data reading code + prefetch to device at the end of your dataset pipeline can help hide the latency of your "Extract" stage of the ETL process.

I would also recommend shuffling the order that the files are loaded and also use the Dataset shuffle ops to avoid the local minima you describe - ideally the examples should be in random order to begin with as well. If you are currently using python code to load your data it may be worth considering preprocessing your data into e.g. TFRecord format so that you can benefit from the native performance of TFRecordDataset.

Additional info that would be useful:

  1. Are you training on a single machine or a cluster?
  2. What is the data format and how are you currently loading it?
like image 184
Ed Bordin Avatar answered Nov 15 '22 07:11

Ed Bordin