Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to reduce wand memory usage?

I am using wand and pytesseract to get the text of pdfs uploaded to a django website like so:

image_pdf = Image(blob=read_pdf_file, resolution=300)
image_png = image_pdf.convert('png')

req_image = []
final_text = []

for img in image_png.sequence:
    img_page = Image(image=img)
    req_image.append(img_page.make_blob('png'))

for img in req_image:
    txt = pytesseract.image_to_string(PI.open(io.BytesIO(img)).convert('RGB'))
    final_text.append(txt)

return " ".join(final_text)

I have it running in celery in a separate ec2 server. However, because the image_pdf grows to approximately 4gb for even a 13.7 mb pdf file, it is being stopped by the oom killer. Instead of paying for higher ram, I want to try to reduce the memory used by wand and ImageMagick. Since it is already async I don't mind increased computation times. I have skimmed this: http://www.imagemagick.org/Usage/files/#massive, but am not sure if it can be implemented with wand. Another possible fix is a way to open a pdf in wand one page at a time rather than putting the full image into RAM at once. Alternatively, how could I interface with ImageMagick directly using python so that I could use these memory limiting techniques?

like image 202
Justin Buhl Avatar asked May 26 '17 20:05

Justin Buhl


2 Answers

Remember that the wand library integrates with MagickWand API, and in turn, delegates PDF encoding/decoding work to ghostscript. Both MagickWand & ghostscript allocated additional memory resources, and do there best to deallocate at the end of each task. However, if routines are initialized by python, and held by a variable, it's more than possible to introduce memory-leaks.

Here's some tips to ensure memory is managed correctly.

  1. Use with context management for all Wand assignments. This will ensure all resources pass through __enter__ & __exit__ management handlers.

  2. Avoid blob creation for passing data. When creating a file-format blob, MagickWand will allocated additional memory to copy & encode the image, and python will hold resulting data in addition to the originating wand instance. Usually fine on the dev environment, but can grow out of hand quickly in a production setting.

  3. Avoid Image.sequence. This is another copy-heavy routine, and results in python holding a bunch of memory resources. Remember ImageMagick manages the image stacks very well, so if you're not reordering / manipulating individual frames, it's best to use MagickWand methods & not involve python.

  4. Each task should be an isolated process, and can cleanly shut-down on completion. This shouldn't be an issue for you w/ celery as a queue worker, but worth double checking the thread/worker configuration + docs.

  5. Watch out for resolution. A pdf resolution of 300 @ 16Q would result in a massive raster image. With many OCR (tesseract/opencv) techniques, the first step is to pre-process the inbound data to remove extra/unneeded colors / channels / data / &tc.

Here's an example of how I would approach this. Note, I'll leverage ctypes to directly manage the image stack w/o additional python resources.

import ctyles
from wand.image import Image
from wand.api import library

# Tell wand about C-API method
library.MagickNextImage.argtypes = [ctypes.c_void_p]
library.MagickNextImage.restype = ctypes.c_int

# ... Skip to calling method ...

final_text = []
with Image(blob=read_pdf_file, resolution=100) as context:
    context.depth = 8
    library.MagickResetIterator(context.wand)
    while(library.MagickNextImage(context.wand) != 0):
        data = context.make_blob("RGB")
        text = pytesseract.image_to_string(data)
        final_text.append(text)
return " ".join(final_text)

Of course your milage may vary. If your comfortable with subprocess, you may be able to execute gs & tesseract directly, and eliminate all the python wrappers.

like image 153
emcconville Avatar answered Oct 18 '22 16:10

emcconville


I was also suffering from memory leaks issues. After some research and tweaking the code implementation, my issues were resolved. I basically worked correctly using with and destroy() function.

In some cases I could use with to open and read the files, as in the example below:

with Image(filename = pdf_file, resolution = 300) as pdf:

This case, using with, the memory and tmp files are correctly managed.

And in another case I had to use the destroy() function, preferably inside a try / finally block, as below:

try:
    for img in pdfImg.sequence:
    # your code
finally:
    pdfImg.destroy()

The second case, is an example where I cann't use with because I had to iterate the pages through the sequence, so, I already had the file open and was iterating your pages.

This conbination of solution resolved my problems with memory leaks.

like image 1
Juliano Pacheco Avatar answered Oct 18 '22 14:10

Juliano Pacheco