Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tips on processing a lot of images in python

I have been trying to process two huge files containing around 40000-50000 images in python. But whenever I try to convert my datasets into a numpy array I get a Memory error. I only have about 8GB RAM which isn't very much, but, because I lack experience in python, I wonder if there is any way that I can resolve this issue by using some python library I don't know about, or maybe by optimizing my code? I would like to hear your opinion on this matter.

My image processing code:

from sklearn.cluster import MiniBatchKMeans
import numpy as np
import glob
import os
from PIL import Image
from sklearn.decomposition import PCA

image_dir1 = "C:/Users/Ai/Desktop/KAGA FOLDER/C/train"
image_dir2 = "C:/Users/Ai/Desktop/KAGA FOLDER/C/test1"
Standard_size = (300,200)
pca = PCA(n_components = 10)
file_open = lambda x,y: glob.glob(os.path.join(x,y))


def matrix_image(image):
    "opens image and converts it to a m*n matrix" 
    image = Image.open(image)
    print("changing size from %s to %s" % (str(image.size), str(Standard_size)))
    image = image.resize(Standard_size)
    image = list(image.getdata())
    image = map(list,image)
    image = np.array(image)
    return image
def flatten_image(image):  
    """
    takes in a n*m numpy array and flattens it to 
    an array of the size (1,m*n)
    """
    s = image.shape[0] * image.shape[1]
    image_wide = image.reshape(1,s)
    return image_wide[0]

if __name__ == "__main__":
    train_images = file_open(image_dir1,"*.jpg")
    test_images = file_open(image_dir2,"*.jpg")
    train_set = []
    test_set = []

    "Loop over all images in files and modify them"
    train_set = [flatten_image(matrix_image(image))for image in train_images]
    test_set = [flatten_image(matrix_image(image))for image in test_images]
    train_set = np.array(train_set) #This is where the Memory Error occurs
    test_set = np.array(test_set)

Small edit: I'm using 64-bit python

like image 986
Learner Avatar asked Oct 18 '13 16:10

Learner


1 Answers

Assuming a 4 byte integer for each pixel, you are trying to hold about 11.2 GB of data in (4*300*200*50000 / (1024)**3). Half that for a 2 byte integer.

You have a few options:

  1. Reduce the number or size of images you are trying to hold in memory
  2. Use a file or database to hold the data instead of memory (may be too slow for some applications)
  3. Use the memory you have more effectively...

Instead of copying from list to numpy, which will temporarily use twice the amount of memory, as you do here:

test_set = [flatten_image(matrix_image(image))for image in test_images]
test_set = np.array(test_set)

Do this:

n = len(test_images)
test_set = numpy.zeros((n,300*200),dtype=int)
for i in range(n):
    test_set[i] = flatten_image(matrix_image(test_images[i]))
like image 132
dlm Avatar answered Sep 28 '22 05:09

dlm