How to solve MemoryError using Python 3.7 pdf2image library?

Question

I'm running a simple PDF to image conversion using Python PDF2Image library. I can certainly understand that the max memory threshold is being crossed by this library to arrive at this error. But, the PDF is 6.6 MB (approx), then why would it take up GBs of memory to throw a memory error?

Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:06:47) [MSC v.1914 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from pdf2image import convert_from_path
>>> pages = convert_from_path(r'C:\Users\aakashba598\Documents\pwc-annual-report-2017-2018.pdf', 200)
Exception in thread Thread-3:
Traceback (most recent call last):
  File "C:\Users\aakashba598\AppData\Local\Programs\Python\Python37-32\lib	hreading.py", line 917, in _bootstrap_inner
    self.run()
  File "C:\Users\aakashba598\AppData\Local\Programs\Python\Python37-32\lib	hreading.py", line 865, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\aakashba598\AppData\Local\Programs\Python\Python37-32\lib\subprocess.py", line 1215, in _readerthread
    buffer.append(fh.read())
MemoryError

Also, what is the possible solution to this?

Update: When I reduced the dpi parameter from the convert_from_path function, it works like a charm. But the pictures produced are low quality (for obvious reasons). Is there a way to fix this? Like batch by batch creation of images and clearing memory everytime. If there is a way, how to go about it?

napuzba · Accepted Answer

Convert the PDF in blocks of 10 pages each time ( 1-10,11-20 and so on ... )

from pdf2image import pdfinfo_from_path,convert_from_path
info = pdfinfo_from_path(pdf_file, userpw=None, poppler_path=None)

maxPages = info["Pages"]
for page in range(1, maxPages+1, 10) : 
   convert_from_path(pdf_file, dpi=200, first_page=page, last_page = min(page+10-1,maxPages))

Belval · Answer

I am a bit late to this, but the problem is indeed related to the 136 pages going into memory. You can do three things.

Specify a format for the converted images.

By default, pdf2image uses PPM as its image format, it is faster, but also takes a lot more memory (over 30MB per image!). What you can do to fix this is use a more memory-friendly format like jpeg or png.

convert_from_path('C:\path	o\your\pdf', fmt='jpeg')

That will probably solve the problem, but it's mostly just because of the compression, and at some point (say for +500pages PDF) the problem will reappear.

Use an output directory

This is the one I would recommend because it allows you to process any PDF. The example on the README page explains it well:

import tempfile

with tempfile.TemporaryDirectory() as path:
    images_from_path = convert_from_path('C:\path	o\your\pdf', output_folder=path)

This writes the image to your computer storage temporarily so you don't have to delete it manually. Make sure to do any processing you need to do before exiting the with context though!

Process the PDF file in chunks

pdf2image allows you to define the first an last page that you want to process. That means that in your case, with a PDF of 136 pages, you could do:

for i in range(0, 136 // 10 + 1):
    convert_from_path('C:\path	o\your\pdf', first_page=i*10, last_page=(i+1)*10)

Bot_Start · Answer

The accepted answer has a small issue.

maxPages = pdf2image._page_count(pdf_file)

can no longer be used, as _page_count is deprecated. I found the working solution for the same.

from PyPDF2 import PdfFileWriter, PdfFileReader    
inputpdf = PdfFileReader(open(pdf, "rb"))
maxPages = inputpdf.numPages
for page in range(1, maxPages, 100):
    pil_images = pdf2image.convert_from_path(pdf, dpi=200, first_page=page,
                                                     last_page=min(page + 100 - 1, maxPages), fmt= 'jpg',
                                                     thread_count=1, userpw=None,
                                                     use_cropbox=False, strict=False)

This way, however large the file, it will process 100 at once and the ram usage is always minimal.

How to solve MemoryError using Python 3.7 pdf2image library?

Tags:

python

python-3.x

data-conversion

out-of-memory

Aakash Basu

3 Answers

napuzba

Belval

Bot_Start

Recent Activity

Donate For Us

How to solve MemoryError using Python 3.7 pdf2image library?

Tags:

python

python-3.x

data-conversion

out-of-memory

Aakash Basu

3 Answers

napuzba

Belval

Bot_Start

Related questions

Recent Activity

Donate For Us