This might be a silly question, but...
I have several thousand images that I would like to load into Python and then convert into numpy arrays. Obviously this goes a little slowly. But, I am actually only interested in a small portion of each image. (The same portion, just 100x100 pixels in the center of the image.)
Is there any way to load just part of the image to make things go faster?
Here is some sample code where I generate some sample images, save them, and load them back in.
import numpy as np
import matplotlib.pyplot as plt
import Image, time
#Generate sample images
num_images = 5
for i in range(0,num_images):
Z = np.random.rand(2000,2000)
print 'saving %i'%i
plt.imsave('%03i.png'%i,Z)
%load the images
for i in range(0,num_images):
t = time.time()
im = Image.open('%03i.png'%i)
w,h = im.size
imc = im.crop((w-50,h-50,w+50,h+50))
print 'Time to open: %.4f seconds'%(time.time()-t)
#convert them to numpy arrays
data = np.array(imc)
Using PIL and im. crop(box) usually works, see pythonware.com/library/pil/handbook/introduction.htm can you post some more code that showcase what you are doing?
To load the image, we simply import the image module from the pillow and call the Image. open(), passing the image filename. Instead of calling the Pillow module, we will call the PIL module as to make it backward compatible with an older module called Python Imaging Library (PIL).
Saving an image in Python is just as simple. You simply call save() and pass in the name you want used to save your image. This method will save the image in the format identified by the extension on the filename you pass in. Listing 3 opens the image.
While you can't get much faster than PIL crop in a single thread, you can use multiple cores to speed up everything! :)
I ran the below code on my 8 core i7 machine as well as my 7 year old, two core, barely 2ghz laptop. Both saw significant improvements in run time. Much as you would expect, the improvement was dependent on the number of cores available.
The core of your code is the same, I just separated the looping from the actual computation so that the function could be applies to a list of values in parallel.
So, this:
for i in range(0,num_images):
t = time.time()
im = Image.open('%03i.png'%i)
w,h = im.size
imc = im.crop((w-50,h-50,w+50,h+50))
print 'Time to open: %.4f seconds'%(time.time()-t)
#convert them to numpy arrays
data = np.array(imc)
Became:
def convert(filename):
im = Image.open(filename)
w,h = im.size
imc = im.crop((w-50,h-50,w+50,h+50))
return numpy.array(imc)
The key to the speedup is the Pool
feature of the multiprocessing
library. It makes it trivial to run things across multiple processors.
import os
import time
import numpy
from PIL import Image
from multiprocessing import Pool
# Path to where my test images are stored
img_folder = os.path.join(os.getcwd(), 'test_images')
# Collects all of the filenames for the images
# I want to process
images = [os.path.join(img_folder,f)
for f in os.listdir(img_folder)
if '.jpeg' in f]
# Your code, but wrapped up in a function
def convert(filename):
im = Image.open(filename)
w,h = im.size
imc = im.crop((w-50,h-50,w+50,h+50))
return numpy.array(imc)
def main():
# This is the hero of the code. It creates pool of
# worker processes across which you can "map" a function
pool = Pool()
t = time.time()
# We run it normally (single core) first
np_arrays = map(convert, images)
print 'Time to open %i images in single thread: %.4f seconds'%(len(images), time.time()-t)
t = time.time()
# now we run the same thing, but this time leveraging the worker pool.
np_arrays = pool.map(convert, images)
print 'Time to open %i images with multiple threads: %.4f seconds'%(len(images), time.time()-t)
if __name__ == '__main__':
main()
Pretty basic. Only a few extra lines of code, and a little refactoring to move the conversion bit into its own function. The results speak for themselves:
Time to open 858 images in single thread: 6.0040 seconds
Time to open 858 images with multiple threads: 1.4800 seconds
Time to open 858 images in single thread: 8.7640 seconds
Time to open 858 images with multiple threads: 4.6440 seconds
So there ya go! Even if you have a super old 2 core machine you can halve the time you spend opening and processing your images.
Memory. If you're processing 1000s of images, you're probably going to pop Pythons Memory limit at some point. To get around this, you'll just have to process the data in chunks. You can still leverage all of the multiprocessing goodness, just in smaller bites. Something like:
for i in range(0, len(images), chunk_size):
results = pool.map(convert, images[i : i+chunk_size])
# rest of code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With