Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast way to import and crop a jpeg in python lib

I have a python app that imports 200k+ images, crops them, and presents the cropped image to pyzbar to interpret a barcode. Cropping helps because there are multiple barcodes on the image and, presumably pyzbar is a little faster when given smaller images.

Currently I am using Pillow to import and crop the image.

On the average importing and cropping an image takes 262 msecs and pyzbar take 8 msecs.

A typical run is about 21 hours.

I wonder if a library other than Pillow might offer substantial improvements in loading/cropping. Ideally the library should be available for MacOS but I could also run the whole thing in a virtual Ubuntu machine.

I am working on a version that can run in parallel processes which will be a big improvement but if I could get 25% or more speed increase from a different library I would also add that.

like image 920
WesR Avatar asked Mar 04 '19 00:03

WesR


People also ask

How do you crop an image in Python?

crop() method is used to crop a rectangular portion of any image. Parameters: box – a 4-tuple defining the left, upper, right, and lower pixel coordinate. Return type: Image (Returns a rectangular region as (left, upper, right, lower)-tuple).

How do I crop multiple images at once in Python?

You can resize multiple images in Python with the awesome PIL library and a small help of the os (operating system) library. By using os. listdir() function you can read all the file names in a directory. After that, all you have to do is to create a for loop to open, resize and save each image in the directory.

How do I crop an image in Python PIL?

To crop an image to a certain area, use the PIL function Image. crop(left, upper, right, lower) that defines the area to be cropped using two points in the coordinate system: (left, upper) and (right, lower) pixel values. Those two points unambiguously define the rectangle to be cropped.

How do I crop an image to a specific size in Python?

Use resize() to resize the whole image instead of cutting out a part of the image, and use putalpha() to create a transparent image by cutting out a shape other than a rectangle (such as a circle). Use slicing to crop the image represented by the NumPy array ndarray . Import Image from PIL and open the target image.


2 Answers

As you didn't provide a sample image, I made a dummy file with dimensions 2544x4200 at 1.1MB in size and it is provided at the end of the answer. I made 1,000 copies of that image and processed all 1,000 images for each benchmark.

As you only gave your code in the comments area, I took it, formatted it and made the best I could of it. I also put it in a loop so it can process many files for just one invocation of the Python interpreter - this becomes important when you have 20,000 files.

That looks like this:

#!/usr/bin/env python3

import sys
from PIL import Image

# Process all input files so we only incur Python startup overhead once
for filename in sys.argv[1:]:
   print(f'Processing: {filename}')
   imgc = Image.open(filename).crop((0, 150, 270, 1050))

My suspicion is that I can make that faster using:

  • GNU Parallel, and/or
  • pyvips

Here is a pyvips version of your code:

#!/usr/bin/env python3

import sys
import pyvips
import numpy as np

# Process all input files so we only incur Python startup overhead once
for filename in sys.argv[1:]:
   print(f'Processing: {filename}')

   img = pyvips.Image.new_from_file(filename, access='sequential')
   roi = img.crop(0, 150, 270, 900)
   mem_img = roi.write_to_memory()

   # Make a numpy array from that buffer object
   nparr = np.ndarray(buffer=mem_img, dtype=np.uint8,
                   shape=[roi.height, roi.width, roi.bands])

Here are the results:

Sequential original code

./orig.py bc*jpg
224 seconds, i.e. 224 ms per image, same as you

Parallel original code

parallel ./orig.py ::: bc*jpg
55 seconds

Parallel original code but passing as many filenames as possible

parallel -X ./orig.py ::: bc*jpg
42 seconds   

Sequential pyvips

./vipsversion bc*
30 seconds, i.e. 7x as fast as PIL which was 224 seconds

Parallel pyvips

parallel ./vipsversion ::: bc*
32 seconds

Parallel pyvips but passing as many filenames as possible

parallel -X ./vipsversion ::: bc*
5.2 seconds, i.e. this is the way to go :-)

enter image description here


Note that you can install GNU Parallel on macOS with homebrew:

brew install parallel
like image 102
Mark Setchell Avatar answered Sep 28 '22 07:09

Mark Setchell


You might take a look on PyTurboJPEG which is a Python wrapper of libjpeg-turbo with insanely fast rescaling (1/2, 1/4, 1/8) while decoding large JPEG image, the returning numpy.ndarray is handy for image cropping. Moreover, JPEG image encoding speed is also remarkable.

from turbojpeg import TurboJPEG

# specifying library path explicitly
# jpeg = TurboJPEG(r'D:\turbojpeg.dll')
# jpeg = TurboJPEG('/usr/lib64/libturbojpeg.so')
# jpeg = TurboJPEG('/usr/local/lib/libturbojpeg.dylib')

# using default library installation
jpeg = TurboJPEG()

# direct rescaling 1/2 while decoding input.jpg to BGR array
in_file = open('input.jpg', 'rb')
bgr_array_half = jpeg.decode(in_file.read(), scaling_factor=(1, 2))
in_file.close()

# encoding BGR array to output.jpg with default settings.
out_file = open('output.jpg', 'wb')
out_file.write(jpeg.encode(bgr_array))
out_file.close()

libjpeg-turbo prebuilt binaries for macOS and Linux are also available here.

like image 30
Lilo Huang Avatar answered Sep 28 '22 07:09

Lilo Huang