I have 3TB dataset and 64GB RAM and a 12 core CPU and one 12GB GPU. would like to train a deep learning model on this dataset. How do I have asynchronous load of batches and training of the model? I want to make sure disk load of data doesn't block training loop to be waiting for the new batch to load into memory.
I am not language dependent and the easiest library that can do this without friction wins but I prefer one of torch, pytorch, tensorflow.
There are two ways to split load of a neural network into several machines: Network / Model parallelism. Today's neural networks are made of many layers. Each layer requires a set of computations that are usually represented as a graph.
We solved this problem in the way @mo-hossny described above (not "tied to the Imagenet folder structure") with Keras (tensorflow backend) and described it in gory detail here.
A brief summary of that: most ML tutorials show a directory structure where the class of training (and test) examples is implied by the subdirectory. For instance, you might see subdirectories and files like data/train/cats/???.png
and data/train/dogs/???.png
, etc.
If instead you create a simple Pandas DataFrame to hold the unique id, class label and file path for each train/test sample, then you can shuffle this DataFrame at the start of each epoch, loop over it in mini-batches and use a generator to send each chunk to the GPU. In the background, the CPU is keeping the queue of chunks full, standing by to send each subsequent one to the GPU as soon as it finishes its current batch.
An example of such a DataFrame is:
df
object_id bi multi path
index
0 461756 dog white /path/to/imgs/756/61/blah_461756.png
1 1161756 cat black /path/to/imgs/756/61/blah_1161756.png
2 3303651 dog white /path/to/imgs/651/03/blah_3303651.png
3 3367756 dog grey /path/to/imgs/756/67/blah_3367756.png
4 3767756 dog grey /path/to/imgs/756/67/blah_3767756.png
5 5467756 cat black /path/to/imgs/756/67/blah_5467756.png
6 5561756 dog white /path/to/imgs/756/61/blah_5561756.png
7 31255756 cat grey /path/to/imgs/756/55/blah_31255756.png
8 35903651 cat black /path/to/imgs/651/03/blah_35903651.png
9 44603651 dog black /path/to/imgs/651/03/blah_44603651.png
10 49557622 cat black /path/to/imgs/622/57/blah_49557622.png
11 58164756 dog grey /path/to/imgs/756/64/blah_58164756.png
12 95403651 cat white /path/to/imgs/651/03/blah_95403651.png
13 95555756 dog grey /path/to/imgs/756/55/blah_95555756.png
I've included labels for binomial and multinomial versions of the problem do demonstrate that the same DataFrame and files can be used in different classification settings.
Once you have this going, the Keras generator code is pretty short and sweet:
train_generator = generator_from_df(df, batch_size, target_size)
where df is similar to my example above and the function generator_from_df() is defined here. It simply loops through the df in chunks of a given size; reads, normalized and concatenates the pixel data specified in the chunk's rows; and finally yields (hence the generator) the X (pixels) and Y (labels) data. The heart of it is very similar to:
i, j = 0, batch_size
for _ in range(nbatches):
sub = df.iloc[i:j]
X = np.array([
(2 *
(img_to_array(load_img(f, target_size=target_size))
/ 255.0 - 0.5))
for f in sub.imgpath])
Y = sub.target.values
yield X, Y
i = j
j += batch_size
count += 1
Note the references and code in the post: we aggregated helpful hints from others in the Keras pages and here on Stackoverflow.
If you don't want to be tied to the Imagenet folder structure you can develop your own data loader pretty much in every framework. A pytorch sample code is available at https://stackoverflow.com/a/45102798/7387369. It loads the next batch while training. Set num_workers to number of threads to run in parallel.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With