I have been trying to process two huge files containing around 40000-50000 images in python. But whenever I try to convert my datasets into a numpy array I get a Memory error. I only have about 8GB RAM which isn't very much, but, because I lack experience in python, I wonder if there is any way that I can resolve this issue by using some python library I don't know about, or maybe by optimizing my code? I would like to hear your opinion on this matter.
My image processing code:
from sklearn.cluster import MiniBatchKMeans
import numpy as np
import glob
import os
from PIL import Image
from sklearn.decomposition import PCA
image_dir1 = "C:/Users/Ai/Desktop/KAGA FOLDER/C/train"
image_dir2 = "C:/Users/Ai/Desktop/KAGA FOLDER/C/test1"
Standard_size = (300,200)
pca = PCA(n_components = 10)
file_open = lambda x,y: glob.glob(os.path.join(x,y))
def matrix_image(image):
    "opens image and converts it to a m*n matrix" 
    image = Image.open(image)
    print("changing size from %s to %s" % (str(image.size), str(Standard_size)))
    image = image.resize(Standard_size)
    image = list(image.getdata())
    image = map(list,image)
    image = np.array(image)
    return image
def flatten_image(image):  
    """
    takes in a n*m numpy array and flattens it to 
    an array of the size (1,m*n)
    """
    s = image.shape[0] * image.shape[1]
    image_wide = image.reshape(1,s)
    return image_wide[0]
if __name__ == "__main__":
    train_images = file_open(image_dir1,"*.jpg")
    test_images = file_open(image_dir2,"*.jpg")
    train_set = []
    test_set = []
    "Loop over all images in files and modify them"
    train_set = [flatten_image(matrix_image(image))for image in train_images]
    test_set = [flatten_image(matrix_image(image))for image in test_images]
    train_set = np.array(train_set) #This is where the Memory Error occurs
    test_set = np.array(test_set)
Small edit: I'm using 64-bit python
Assuming a 4 byte integer for each pixel, you are trying to hold about 11.2 GB of data in (4*300*200*50000 / (1024)**3). Half that for a 2 byte integer.
You have a few options:
Instead of copying from list to numpy, which will temporarily use twice the amount of memory, as you do here:
test_set = [flatten_image(matrix_image(image))for image in test_images]
test_set = np.array(test_set)
Do this:
n = len(test_images)
test_set = numpy.zeros((n,300*200),dtype=int)
for i in range(n):
    test_set[i] = flatten_image(matrix_image(test_images[i]))
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With