I am using wand and pytesseract to get the text of pdfs uploaded to a django website like so:
image_pdf = Image(blob=read_pdf_file, resolution=300)
image_png = image_pdf.convert('png')
req_image = []
final_text = []
for img in image_png.sequence:
img_page = Image(image=img)
req_image.append(img_page.make_blob('png'))
for img in req_image:
txt = pytesseract.image_to_string(PI.open(io.BytesIO(img)).convert('RGB'))
final_text.append(txt)
return " ".join(final_text)
I have it running in celery in a separate ec2 server. However, because the image_pdf grows to approximately 4gb for even a 13.7 mb pdf file, it is being stopped by the oom killer. Instead of paying for higher ram, I want to try to reduce the memory used by wand and ImageMagick. Since it is already async I don't mind increased computation times. I have skimmed this: http://www.imagemagick.org/Usage/files/#massive, but am not sure if it can be implemented with wand. Another possible fix is a way to open a pdf in wand one page at a time rather than putting the full image into RAM at once. Alternatively, how could I interface with ImageMagick directly using python so that I could use these memory limiting techniques?
Remember that the wand library integrates with MagickWand
API, and in turn, delegates PDF encoding/decoding work to ghostscript
. Both MagickWand
& ghostscript
allocated additional memory resources, and do there best to deallocate at the end of each task. However, if routines are initialized by python, and held by a variable, it's more than possible to introduce memory-leaks.
Here's some tips to ensure memory is managed correctly.
Use with
context management for all Wand assignments. This will ensure all resources pass through __enter__
& __exit__
management handlers.
Avoid blob
creation for passing data. When creating a file-format blob, MagickWand will allocated additional memory to copy & encode the image, and python will hold resulting data in addition to the originating wand instance. Usually fine on the dev environment, but can grow out of hand quickly in a production setting.
Avoid Image.sequence
. This is another copy-heavy routine, and results in python holding a bunch of memory resources. Remember ImageMagick manages the image stacks very well, so if you're not reordering / manipulating individual frames, it's best to use MagickWand methods & not involve python.
Each task should be an isolated process, and can cleanly shut-down on completion. This shouldn't be an issue for you w/ celery
as a queue worker, but worth double checking the thread/worker configuration + docs.
Watch out for resolution. A pdf resolution of 300 @ 16Q would result in a massive raster image. With many OCR (tesseract/opencv) techniques, the first step is to pre-process the inbound data to remove extra/unneeded colors / channels / data / &tc.
Here's an example of how I would approach this. Note, I'll leverage ctypes to directly manage the image stack w/o additional python resources.
import ctyles
from wand.image import Image
from wand.api import library
# Tell wand about C-API method
library.MagickNextImage.argtypes = [ctypes.c_void_p]
library.MagickNextImage.restype = ctypes.c_int
# ... Skip to calling method ...
final_text = []
with Image(blob=read_pdf_file, resolution=100) as context:
context.depth = 8
library.MagickResetIterator(context.wand)
while(library.MagickNextImage(context.wand) != 0):
data = context.make_blob("RGB")
text = pytesseract.image_to_string(data)
final_text.append(text)
return " ".join(final_text)
Of course your milage may vary. If your comfortable with subprocess, you may be able to execute gs
& tesseract
directly, and eliminate all the python wrappers.
I was also suffering from memory leaks issues. After some research and tweaking the code implementation, my issues were resolved. I basically worked correctly using with and destroy() function.
In some cases I could use with to open and read the files, as in the example below:
with Image(filename = pdf_file, resolution = 300) as pdf:
This case, using with, the memory and tmp files are correctly managed.
And in another case I had to use the destroy() function, preferably inside a try / finally block, as below:
try:
for img in pdfImg.sequence:
# your code
finally:
pdfImg.destroy()
The second case, is an example where I cann't use with because I had to iterate the pages through the sequence, so, I already had the file open and was iterating your pages.
This conbination of solution resolved my problems with memory leaks.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With