Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to solve MemoryError using Python 3.7 pdf2image library?

I'm running a simple PDF to image conversion using Python PDF2Image library. I can certainly understand that the max memory threshold is being crossed by this library to arrive at this error. But, the PDF is 6.6 MB (approx), then why would it take up GBs of memory to throw a memory error?

Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:06:47) [MSC v.1914 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from pdf2image import convert_from_path
>>> pages = convert_from_path(r'C:\Users\aakashba598\Documents\pwc-annual-report-2017-2018.pdf', 200)
Exception in thread Thread-3:
Traceback (most recent call last):
  File "C:\Users\aakashba598\AppData\Local\Programs\Python\Python37-32\lib\threading.py", line 917, in _bootstrap_inner
    self.run()
  File "C:\Users\aakashba598\AppData\Local\Programs\Python\Python37-32\lib\threading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\aakashba598\AppData\Local\Programs\Python\Python37-32\lib\subprocess.py", line 1215, in _readerthread
    buffer.append(fh.read())
MemoryError

Also, what is the possible solution to this?

Update: When I reduced the dpi parameter from the convert_from_path function, it works like a charm. But the pictures produced are low quality (for obvious reasons). Is there a way to fix this? Like batch by batch creation of images and clearing memory everytime. If there is a way, how to go about it?

like image 538
Aakash Basu Avatar asked Jun 06 '19 06:06

Aakash Basu


3 Answers

Convert the PDF in blocks of 10 pages each time ( 1-10,11-20 and so on ... )

from pdf2image import pdfinfo_from_path,convert_from_path
info = pdfinfo_from_path(pdf_file, userpw=None, poppler_path=None)

maxPages = info["Pages"]
for page in range(1, maxPages+1, 10) : 
   convert_from_path(pdf_file, dpi=200, first_page=page, last_page = min(page+10-1,maxPages))
like image 198
napuzba Avatar answered Nov 07 '22 08:11

napuzba


I am a bit late to this, but the problem is indeed related to the 136 pages going into memory. You can do three things.

  1. Specify a format for the converted images.

By default, pdf2image uses PPM as its image format, it is faster, but also takes a lot more memory (over 30MB per image!). What you can do to fix this is use a more memory-friendly format like jpeg or png.

convert_from_path('C:\path\to\your\pdf', fmt='jpeg')

That will probably solve the problem, but it's mostly just because of the compression, and at some point (say for +500pages PDF) the problem will reappear.

  1. Use an output directory

This is the one I would recommend because it allows you to process any PDF. The example on the README page explains it well:

import tempfile

with tempfile.TemporaryDirectory() as path:
    images_from_path = convert_from_path('C:\path\to\your\pdf', output_folder=path)

This writes the image to your computer storage temporarily so you don't have to delete it manually. Make sure to do any processing you need to do before exiting the with context though!

  1. Process the PDF file in chunks

pdf2image allows you to define the first an last page that you want to process. That means that in your case, with a PDF of 136 pages, you could do:

for i in range(0, 136 // 10 + 1):
    convert_from_path('C:\path\to\your\pdf', first_page=i*10, last_page=(i+1)*10)
like image 30
Belval Avatar answered Nov 07 '22 07:11

Belval


The accepted answer has a small issue.

maxPages = pdf2image._page_count(pdf_file)

can no longer be used, as _page_count is deprecated. I found the working solution for the same.

from PyPDF2 import PdfFileWriter, PdfFileReader    
inputpdf = PdfFileReader(open(pdf, "rb"))
maxPages = inputpdf.numPages
for page in range(1, maxPages, 100):
    pil_images = pdf2image.convert_from_path(pdf, dpi=200, first_page=page,
                                                     last_page=min(page + 100 - 1, maxPages), fmt= 'jpg',
                                                     thread_count=1, userpw=None,
                                                     use_cropbox=False, strict=False)

This way, however large the file, it will process 100 at once and the ram usage is always minimal.

like image 6
Bot_Start Avatar answered Nov 07 '22 07:11

Bot_Start