Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keras fit_generator with pandas iterator object

I've got a csv too big to read into memory at once so I want to chunk it out and fit a keras model with it piece by piece. I think I'm misunderstanding how the fit_generator function works though since I keep getting StopIteration errors even though the chunksize & steps_per_epoch correctly account for how many rows are in my csv.

Code:

import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout

np.random.seed(26)
x_train_generator = pd.read_csv('X_train.csv', header=None, chunksize=150000)
y_train_generator = pd.read_csv('Y_train.csv', header=None, chunksize=150000)
x_test_generator = pd.read_csv('X_test.csv', header=None, chunksize=50000)
y_test_generator = pd.read_csv('Y_test.csv', header=None, chunksize=50000)

model = Sequential()
model.add(Dense(500, input_dim=1132, activation='tanh'))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', metrics=['accuracy'],
              optimizer='adam')

model.fit_generator((x_train_generator.get_chunk().as_matrix(),
                     y_train_generator.get_chunk().as_matrix()),
          steps_per_epoch=37,
          epochs=1,
          verbose=2,
          validation_data=(x_test_generator.get_chunk().as_matrix(),
                           y_test_generator.get_chunk().as_matrix()),
          validation_steps=37
            )

Error output:

Exception in thread Thread-107:                                                                                                                                                                             
Traceback (most recent call last):                                                                                                                                                                          
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner                                                                                                                                    
    self.run()                                                                                                                                                                                              
  File "/usr/lib/python2.7/threading.py", line 754, in run                                                                                                                                                  
    self.__target(*self.__args, **self.__kwargs)
  File "/home/user/myenv/local/lib/python2.7/site-packages/keras/utils/data_utils.py", line 568, in data_generator_task
    generator_output = next(self._generator)
TypeError: tuple object is not an iterator

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
/home/user/tmp_keras.py in <module>()
     22           verbose=2,
     23           validation_data=(x_test_generator.get_chunk().as_matrix(), y_test_generator.get_chunk().as_matrix()),
---> 24           validation_steps=37
     25                 )
     26

/home/user/myenv/local/lib/python2.7/site-packages/keras/legacy/interfaces.pyc in wrapper(*args, **kwargs)
     85                 warnings.warn('Update your `' + object_name +
     86                               '` call to the Keras 2 API: ' + signature, stacklevel=2)
---> 87             return func(*args, **kwargs)
     88         wrapper._original_function = func
     89         return wrapper

/home/user/myenv/local/lib/python2.7/site-packages/keras/models.pyc in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_$ueue_size, workers, use_multiprocessing, initial_epoch)
   1119                                         workers=workers,
   1120                                         use_multiprocessing=use_multiprocessing,
-> 1121                                         initial_epoch=initial_epoch)
   1122
   1123     @interfaces.legacy_generator_methods_support

/home/user/myenv/local/lib/python2.7/site-packages/keras/legacy/interfaces.pyc in wrapper(*args, **kwargs)
     85                 warnings.warn('Update your `' + object_name +
     86                               '` call to the Keras 2 API: ' + signature, stacklevel=2)
---> 87             return func(*args, **kwargs)
     88         wrapper._original_function = func
     89         return wrapper

/home/user/myenv/local/lib/python2.7/site-packages/keras/engine/training.pyc in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weig
ht, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
   2009                 batch_index = 0
   2010                 while steps_done < steps_per_epoch:
-> 2011                     generator_output = next(output_generator)
   2012
   2013                     if not hasattr(generator_output, '__len__'):

StopIteration:

Weirdly, if I wrap the fit_generator() in a while 1: try: ... except StopIteration: it manages to run.

I've tried using x/y_train_generator in the fit_generator arguments without the get_chunk().as_matrix() functions but it fails since I'm not passing keras a numpy array.

like image 463
user3555455 Avatar asked Oct 09 '17 02:10

user3555455


1 Answers

As mentioned in the comments, your issue is that Pandas .get_chunk() returns an iterator, which is what the .as_matrix() method is called on (and not what you want to happen - you want the iterator returned by .get_chunk() to be transformed into a DataFrame first, then .as_matrix() to be called).

To restructure your code, you'll need a loop, and you'll need to update your model inside the loop. I have two suggestions for you:

  1. (Easiest) Re-structure the program above: have a loop over each chunk from Pandas as a DataFrame, before you call .as_matrix() on it. That way, you are actually getting a concrete DataFrame for your X_train, y_train, X_test, y_test data, instead of an IO iterator. You can then update your trained model using the new chunk of data. (If you already have a trained model, and you call .fit() again, it will update the existing model.)

  2. (Using Keras functionality instead of Pandas functionality) Utilize built-in Keras utilities for reading large data sets - specifically, a Keras utility called HDF5Matrix (link to Keras documentation) to read data from an HDF5 file in chunks, and that data will be transparently treated as a Numpy array. Something like this:

    def load_data(path_todata, start_ix, n_samples):
        """
        This works for loading testing or training data.
        This assumes input data have been named "inputs",
        output data have been named "outputs" in HDF5 file,
        and that you are grabbing n_samples from the file.
        """
        X = HDF5Matrix(path_to_training_data, 'inputs', start_ix, start_ix + n_samples)
        y = HDF5Matrix(path_to_training_data, 'outputs', start_ix, start_ix + n_samples)
        return (X,y)
    
    X_train, y_train = load_data(path_to_training_h5, train_start_ix, n_training_samples)
    X_test,  y_test  = load_data(path_to_testing_h5, testing_start_ix, n_testing_samples)
    

Like solution #1, this would be structured within an overarching for loop that updates start_ix and n_samples within each iteration, in addition to updating (re-fitting) the model within each iteration. For another illustration of how to use HDF5Matrix see this example from Github user @jfsantos.

like image 60
charlesreid1 Avatar answered Oct 08 '22 14:10

charlesreid1