Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to asynchronously load and train batches to train a DeepLearning model?

I have 3TB dataset and 64GB RAM and a 12 core CPU and one 12GB GPU. would like to train a deep learning model on this dataset. How do I have asynchronous load of batches and training of the model? I want to make sure disk load of data doesn't block training loop to be waiting for the new batch to load into memory.

I am not language dependent and the easiest library that can do this without friction wins but I prefer one of torch, pytorch, tensorflow.

like image 286
Morteza Shahriari Nia Avatar asked May 12 '17 21:05

Morteza Shahriari Nia


People also ask

How do you distribute neural networks?

There are two ways to split load of a neural network into several machines: Network / Model parallelism. Today's neural networks are made of many layers. Each layer requires a set of computations that are usually represented as a graph.


2 Answers

We solved this problem in the way @mo-hossny described above (not "tied to the Imagenet folder structure") with Keras (tensorflow backend) and described it in gory detail here.

A brief summary of that: most ML tutorials show a directory structure where the class of training (and test) examples is implied by the subdirectory. For instance, you might see subdirectories and files like data/train/cats/???.png and data/train/dogs/???.png, etc.

If instead you create a simple Pandas DataFrame to hold the unique id, class label and file path for each train/test sample, then you can shuffle this DataFrame at the start of each epoch, loop over it in mini-batches and use a generator to send each chunk to the GPU. In the background, the CPU is keeping the queue of chunks full, standing by to send each subsequent one to the GPU as soon as it finishes its current batch.

An example of such a DataFrame is:

df

       object_id   bi  multi                                    path
index                                                               
 0        461756  dog  white    /path/to/imgs/756/61/blah_461756.png
 1       1161756  cat  black   /path/to/imgs/756/61/blah_1161756.png
 2       3303651  dog  white   /path/to/imgs/651/03/blah_3303651.png
 3       3367756  dog   grey   /path/to/imgs/756/67/blah_3367756.png
 4       3767756  dog   grey   /path/to/imgs/756/67/blah_3767756.png
 5       5467756  cat  black   /path/to/imgs/756/67/blah_5467756.png
 6       5561756  dog  white   /path/to/imgs/756/61/blah_5561756.png
 7      31255756  cat   grey  /path/to/imgs/756/55/blah_31255756.png
 8      35903651  cat  black  /path/to/imgs/651/03/blah_35903651.png
 9      44603651  dog  black  /path/to/imgs/651/03/blah_44603651.png
10      49557622  cat  black  /path/to/imgs/622/57/blah_49557622.png
11      58164756  dog   grey  /path/to/imgs/756/64/blah_58164756.png
12      95403651  cat  white  /path/to/imgs/651/03/blah_95403651.png
13      95555756  dog   grey  /path/to/imgs/756/55/blah_95555756.png

I've included labels for binomial and multinomial versions of the problem do demonstrate that the same DataFrame and files can be used in different classification settings.

Once you have this going, the Keras generator code is pretty short and sweet:

train_generator = generator_from_df(df, batch_size, target_size)

where df is similar to my example above and the function generator_from_df() is defined here. It simply loops through the df in chunks of a given size; reads, normalized and concatenates the pixel data specified in the chunk's rows; and finally yields (hence the generator) the X (pixels) and Y (labels) data. The heart of it is very similar to:

i, j = 0, batch_size
for _ in range(nbatches):
    sub = df.iloc[i:j]
    X = np.array([
        (2 *
         (img_to_array(load_img(f, target_size=target_size))
          / 255.0 - 0.5))
        for f in sub.imgpath])
    Y = sub.target.values
    yield X, Y
    i = j
    j += batch_size
    count += 1

Note the references and code in the post: we aggregated helpful hints from others in the Keras pages and here on Stackoverflow.

like image 111
timehaven Avatar answered Sep 22 '22 00:09

timehaven


If you don't want to be tied to the Imagenet folder structure you can develop your own data loader pretty much in every framework. A pytorch sample code is available at https://stackoverflow.com/a/45102798/7387369. It loads the next batch while training. Set num_workers to number of threads to run in parallel.

like image 26
Mo Hossny Avatar answered Sep 23 '22 00:09

Mo Hossny