Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keras: batch training for multiple large datasets

this question regards the common problem of training on multiple large files in Keras which are jointly too large to fit on GPU memory. I am using Keras 1.0.5 and I would like a solution that does not require 1.0.6. One way to do this was described by fchollet here and here:

# Create generator that yields (current features X, current labels y)
def BatchGenerator(files):
    for file in files:
        current_data = pickle.load(open("file", "rb"))
        X_train = current_data[:,:-1]
        y_train = current_data[:,-1]
        yield (X_train, y_train)

# train model on each dataset
for epoch in range(n_epochs):
    for (X_train, y_train) in BatchGenerator(files):
        model.fit(X_train, y_train, batch_size = 32, nb_epoch = 1)

However I fear that the state of the model is not saved, rather that the model is reinitialized not only between epochs but also between datasets. Each "Epoch 1/1" represents training on a different dataset below:

~~~~~ Epoch 0 ~~~~~~

Epoch 1/1 295806/295806 [==============================] - 13s - loss: 15.7517
Epoch 1/1 407890/407890 [==============================] - 19s - loss: 15.8036
Epoch 1/1 383188/383188 [==============================] - 19s - loss: 15.8130
~~~~~ Epoch 1 ~~~~~~

Epoch 1/1 295806/295806 [==============================] - 14s - loss: 15.7517
Epoch 1/1 407890/407890 [==============================] - 20s - loss: 15.8036
Epoch 1/1 383188/383188 [==============================] - 15s - loss: 15.8130

I am aware that one can use model.fit_generator but as the method above was repeatedly suggested as a way of batch training I would like to know what I am doing wrong.

Thanks for your help,

Max

like image 783
tenticon Avatar asked Aug 06 '16 14:08

tenticon


People also ask

How can I use keras with datasets that don't fit in memory?

Note: As our dataset is too large to fit in memory, we have to load the dataset from the hard disk in batches to our memory. To do so, we are going to create a custom generator. Our Custom Generator is going to load the dataset from the hard disk in batches to memory.

Is used to train systems on huge data sets that Cannot fit in one machines main memory?

Online learning algorithms can also be used to train systems on huge datasets that cannot fit in one machine's main memory (this is called out-of-core learning).


1 Answers

It has been a while since I faced that problem but I remember that I used
Kera's functionality to provide data through Python generators, i.e. model = Sequential(); model.fit_generator(...).

An exemplary code snippet (should be self-explanatory)

def generate_batches(files, batch_size):
   counter = 0
   while True:
     fname = files[counter]
     print(fname)
     counter = (counter + 1) % len(files)
     data_bundle = pickle.load(open(fname, "rb"))
     X_train = data_bundle[0].astype(np.float32)
     y_train = data_bundle[1].astype(np.float32)
     y_train = y_train.flatten()
     for cbatch in range(0, X_train.shape[0], batch_size):
         yield (X_train[cbatch:(cbatch + batch_size),:,:], y_train[cbatch:(cbatch + batch_size)])

model = Sequential()
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

train_files = [train_bundle_loc + "bundle_" + cb.__str__() for cb in range(nb_train_bundles)]
gen = generate_batches(files=train_files, batch_size=batch_size)
history = model.fit_generator(gen, samples_per_epoch=samples_per_epoch, nb_epoch=num_epoch,verbose=1, class_weight=class_weights)
like image 194
tenticon Avatar answered Oct 16 '22 11:10

tenticon