Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Transform Pandas Column into Numpy Array for Keras Neural Network

I am learning how to create CNN models and thought that Kaggle hosted an interesting competition to help me learn it.

They provided a large JSON-like (BSON) file, around 50GB, that I am trying to process. I am trying to train a convolutional neural network using the Keras module. In the file I am iteratively reading the image data which has the array structure of (180 , 180, 3). The whole file contains around 7,000,000 images so the final array structure would look like (7000000, 180, 180, 3). However, I cannot read all of this data into memory so what I am aiming to do is read in only 100,000 images at a time to fit the neural network, save the model's weights, delete the array to free up memory, then continue reading the next 100,000 images into a new array to re-fit the previously trained model. I would do this iteratively until I reach the last image.

I initially tried to use 'np.append()', to append each image array together iteratively, however, this took a lot of time as I only got through 25,000 images, resulting in an array structure of (25000, 180, 180, 3), in 10 hours and it was appending very slow near the end due to size.

I then tried to use a different approach by using pandas dataframe structure. I appended each (1, 180, 180, 3) array into each cell into one column. I was able to iterate through 100,000 images in around 20 minutes using this method (most of the code is provided through Kaggle - https://www.kaggle.com/inversion/processing-bson-files) but I modified it below:

# Simple data processing
from bson.json_util import dumps
data = bson.decode_file_iter(open('train.bson', 'rb'))

prod_to_category = dict()

i = 0
j = 1000

# Loop through dataset
for c, d in enumerate(data):
    product_id = d['_id']
    category_id = d['category_id'] # This won't be in Test data
    prod_to_category[product_id] = category_id
    i+=1

    # Create a counter to check how many records have been iterated through
    if (i == 1):
        print (i, "records loaded")
        print(picture_1.shape)
        j+=1000
    for e, pic in enumerate(d['imgs']):

    # Reshape the array and append image array data
        if (i == 0):
            picture_1 = np.reshape(imread(io.BytesIO(pic['picture'])), (1,180,180,3))
            get = pd.DataFrame({'A': [product_id], 'B': [category_id], 'C':[picture_1]})
            frames = get

            break
        else:
            picture_2 = np.reshape(imread(io.BytesIO(pic['picture'])), (1,180,180,3))
            get2 = pd.DataFrame({'A': [product_id], 'B': [category_id], 'C':[picture_2]})
            frames = frames.append(get2)

            break

So a header of the pandas data frame, 'frames' , looks like this. Note, in this example pretend that I stopped the loop exactly at 100,000 records:

enter image description here

How would I be able to convert this entire column 'C', with each cell appearing to have an array structure (1, 180, 180, 3), into a Numpy array of structure (100000, 180, 180, 3) so then I can feed this into my neural network? Preferably do not want to use a for loop to do this.

I have looked online and tried multiple things but could not find out how to do this. Once I figure this out, I should be able to re-train my network with a new array of 100,000 images, and do this over and over until I have fitted all the seven million images to my model. I am really new to this kind of stuff, so any other help or suggestions would be much appreciated.

like image 343
James Avatar asked Dec 23 '22 13:12

James


2 Answers

Edit: answer is overkill, given you were looking for a simple Pandas function, but I'll leave the answer here in case it helps someone else doing out-of-memory training with Keras.

You should definitely look into using HDF5. This is a compressed file format that allows you to store data in a hierarchical fashion, and load data selectively. Think of it like a zip file, with a folder structure. If you're working in Python, you can use h5py (link to h5py documentation, also a very dense and useful O'Reilly book on the topic if you have $$ or access to Safari Bookshelf, which most public libraries do).

Create HDF5 file with data manually

To use h5py, you'll create an HDF5 file and add data iteratively to it. You'll have to make one pass through your data to compress it (create an HDF5 structure, and iterate through each image to add it to the HDF5 file). You might want to divide it into batches of N images within the HDF5 file yourself, but that isn't strictly necessary (see below). You could do this on your local machine, or on a high-memory compute instance using the cloud provider of your choice.

For example, suppose you define a load_images() function that will grab a certain number of images, from start_index to end_index, and would return a nested np.array() of np.array() objects (I'll leave this to you to define, but it seems like you already have this, or at least have something very close). Then you would load the data into an HDF5 file like this:

image_set_1 = load_images(path_to_bson, start_index, end_index)
with h5py.File(output_path, mode="w") as h5file:
    h5file.create_dataset("image_set_1", data=image_set_1)

Use fuel

I recommend the library fuel, which was designed to organize/compress/store large datasets for use in Keras, Theano, and Lasagne. It basically does the same thing as above, but with a lot more options. To use it, you would:

  • Define a fuel dataset (basically a stub class)
  • Define a fuel downloader (a way of obtaining the data - could be locally available, since you already have it)
  • Define a fuel converter (something that will iterate through the data and add it to an HDF5 file, similar to above code snippet)

(Documentation gives a toy example using H5PYDataset class, which is basically what you'd follow.)

Then you run the fuel-download <name-of-dataset> utility to download your dataset, then fuel-convert <name-of-datset> to run the converter.

The end result is an HDF5 file that contains your data in an organized fashion, and you now have a number of ways of accessing and sampling the data. For example, you can create a DataStream that will create an arbitrary iterator, and pass it an iteration scheme where you can specify your own custom batch sizes, sample randomly or in order, or sample according to a particular batch "schedule". (See DataStreams in documentation.)

Example: say your data set has 100,000 images. A fuel converter would stuff all of those 100,000 images into an HDF5 file (using whatever scheme you've defined - perhaps you want to organize them according to tasks, or perhaps you want to leave them all flat. Up to you.) Once you run the converter, your data is a fuel data set. Then you might say, I want to train my neural network using images in shuffled order - then you'd use a ShuffledScheme. Then tomorrow you might say, I want to iterate through images in order - then you'd use a SequentialScheme. Then you might say, I want to specify the images to use for each batch - then you'd use a BatchScheme. That's the sort of flexibility fuel gives you.

Use Keras HDF5Matrix

The last option is to use the Keras built-in utilites for dealing with HDF5 files: HDF5Matrix. The workflow would look similar to the HDF5 workflow mentioned above (make a single pass through all of your data to compress it into an HDF5 file), except you can now selectively load portions of the data from Keras directly. This would be more conducive to a situation where you would group the images by batch yourself, in the HDF5 file, and your workflow would look like the following:

  • Load batch1 with keras.HDF5Matrix()
  • Train the model with batch1
  • Load batch2 with keras.HDF5Matrix()
  • Train the model with batch2
  • etc...

These are fairly straightforward to write yourself (there are several examples available for various data sets, and other examples available for other data sets on Github).

Alternatively, you could load larger chunks (or all) of the data as a very large numpy array, and use the start and end arguments for the HDF5Matrix() call to limit the amount of data you're loading. That would also require reshaping your numpy data, though.

Final recommendation

My overall recommendation would be to use fuel. I've successfully used it for some very large data sets and outside-of-memory training schemes.

like image 162
charlesreid1 Avatar answered May 22 '23 02:05

charlesreid1


You can use .tolist():

# example data
N = 20000
cdata = np.random.random(N).reshape(10, 20, 20, 5)
adata = [True] * len(cdata)
df = pd.DataFrame({"A":adata, "C":cdata.tolist()})

df.head()
      A                                                  C
0  True  [[[0.18399037775743088, 0.6762324340882544, 0....
1  True  [[[0.9030084241016858, 0.4060105756597291, 0.4...
2  True  [[[0.2659580640570838, 0.8247979431136298, 0.6...
3  True  [[[0.9626035946363627, 0.16487112072561239, 0....
4  True  [[[0.034946598341842106, 0.17646725825025167, ...

c = np.array(df.C.tolist())

c.shape 
# (10, 20, 20, 5)
like image 36
andrew_reece Avatar answered May 22 '23 04:05

andrew_reece