Fast way to import and crop a jpeg in python lib

Tags:

I have a python app that imports 200k+ images, crops them, and presents the cropped image to pyzbar to interpret a barcode. Cropping helps because there are multiple barcodes on the image and, presumably pyzbar is a little faster when given smaller images.

Currently I am using Pillow to import and crop the image.

On the average importing and cropping an image takes 262 msecs and pyzbar take 8 msecs.

A typical run is about 21 hours.

I wonder if a library other than Pillow might offer substantial improvements in loading/cropping. Ideally the library should be available for MacOS but I could also run the whole thing in a virtual Ubuntu machine.

I am working on a version that can run in parallel processes which will be a big improvement but if I could get 25% or more speed increase from a different library I would also add that.

920

asked Mar 04 '19 00:03

WesR

2 Answers

As you didn't provide a sample image, I made a dummy file with dimensions 2544x4200 at 1.1MB in size and it is provided at the end of the answer. I made 1,000 copies of that image and processed all 1,000 images for each benchmark.

As you only gave your code in the comments area, I took it, formatted it and made the best I could of it. I also put it in a loop so it can process many files for just one invocation of the Python interpreter - this becomes important when you have 20,000 files.

That looks like this:

Click to copy

#!/usr/bin/env python3

import sys
from PIL import Image

# Process all input files so we only incur Python startup overhead once
for filename in sys.argv[1:]:
   print(f'Processing: {filename}')
   imgc = Image.open(filename).crop((0, 150, 270, 1050))

My suspicion is that I can make that faster using:

GNU Parallel, and/or
pyvips

Here is a pyvips version of your code:

Click to copy

#!/usr/bin/env python3

import sys
import pyvips
import numpy as np

# Process all input files so we only incur Python startup overhead once
for filename in sys.argv[1:]:
   print(f'Processing: {filename}')

   img = pyvips.Image.new_from_file(filename, access='sequential')
   roi = img.crop(0, 150, 270, 900)
   mem_img = roi.write_to_memory()

   # Make a numpy array from that buffer object
   nparr = np.ndarray(buffer=mem_img, dtype=np.uint8,
                   shape=[roi.height, roi.width, roi.bands])

Here are the results:

Sequential original code

Click to copy

./orig.py bc*jpg
224 seconds, i.e. 224 ms per image, same as you

Parallel original code

Click to copy

parallel ./orig.py ::: bc*jpg
55 seconds

Parallel original code but passing as many filenames as possible

Click to copy

parallel -X ./orig.py ::: bc*jpg
42 seconds

Sequential pyvips

Click to copy

./vipsversion bc*
30 seconds, i.e. 7x as fast as PIL which was 224 seconds

Parallel pyvips

Click to copy

parallel ./vipsversion ::: bc*
32 seconds

Parallel pyvips but passing as many filenames as possible

Click to copy

parallel -X ./vipsversion ::: bc*
5.2 seconds, i.e. this is the way to go :-)

enter image description here

Note that you can install GNU Parallel on macOS with homebrew:

Click to copy

brew install parallel

102

answered Sep 28 '22 07:09

Mark Setchell

You might take a look on PyTurboJPEG which is a Python wrapper of libjpeg-turbo with insanely fast rescaling (1/2, 1/4, 1/8) while decoding large JPEG image, the returning numpy.ndarray is handy for image cropping. Moreover, JPEG image encoding speed is also remarkable.

Click to copy

from turbojpeg import TurboJPEG

# specifying library path explicitly
# jpeg = TurboJPEG(r'D:\turbojpeg.dll')
# jpeg = TurboJPEG('/usr/lib64/libturbojpeg.so')
# jpeg = TurboJPEG('/usr/local/lib/libturbojpeg.dylib')

# using default library installation
jpeg = TurboJPEG()

# direct rescaling 1/2 while decoding input.jpg to BGR array
in_file = open('input.jpg', 'rb')
bgr_array_half = jpeg.decode(in_file.read(), scaling_factor=(1, 2))
in_file.close()

# encoding BGR array to output.jpg with default settings.
out_file = open('output.jpg', 'wb')
out_file.write(jpeg.encode(bgr_array))
out_file.close()

libjpeg-turbo prebuilt binaries for macOS and Linux are also available here.

answered Sep 28 '22 07:09

Lilo Huang

Related questions
                            
                                How can I train my Python based OCR with Tesseract to train with different National Identity Cards?
                            
                                Why can’t you use Hash Tables/Dictionaries in Counting Sort algorithm?
                            
                                pytest can't see logs from function being tested
                            
                                How can I get around Keras pad_sequences() rounding float values to zero?
                            
                                delete leap days in pandas
                            
                                Adding a new column with specific dtype in pandas
                            
                                Can't install numpy after a pip upgrade
                            
                                "Feather" library installation failing in PyCharm
                            
                                How to write a regular expression utilizing the Robot Framework to find/replace various date strings
                            
                                How can I make a psycopg2 connection using environment variables?
                            
                                Tensorflow error "has type list, but expected one of: int, long, float"
                            
                                How to run a Method on the exit of a kivy app
                            
                                Reversing string characters while keeping them in the same position
                            
                                create an image with border of certain width in python
                            
                                Unable to connect to flask while running on docker container [duplicate]
                            
                                How to calculate class weights of a Pandas DataFrame for Keras?
                            
                                What is the best way to change the widget type in an hvplot/holoviews/panel object?
                            
                                Seaborn & Matplotlib Adding Text Relative to Axes
                            
                                Cannot connect to Jupyter Notebook
                            
                                How to get the probability of bigrams in a text of sentences?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fast way to import and crop a jpeg in python lib

Tags:

python

image-processing

jpeg

WesR

People also ask

2 Answers

Sequential original code

Parallel original code

Parallel original code but passing as many filenames as possible

Sequential pyvips

Parallel pyvips

Parallel pyvips but passing as many filenames as possible

Mark Setchell

Lilo Huang

Recent Activity

Donate For Us