I'm running a simple PDF to image conversion using Python PDF2Image library. I can certainly understand that the max memory threshold is being crossed by this library to arrive at this error. But, the PDF is 6.6 MB (approx), then why would it take up GBs of memory to throw a memory error?
Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:06:47) [MSC v.1914 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from pdf2image import convert_from_path
>>> pages = convert_from_path(r'C:\Users\aakashba598\Documents\pwc-annual-report-2017-2018.pdf', 200)
Exception in thread Thread-3:
Traceback (most recent call last):
File "C:\Users\aakashba598\AppData\Local\Programs\Python\Python37-32\lib\threading.py", line 917, in _bootstrap_inner
self.run()
File "C:\Users\aakashba598\AppData\Local\Programs\Python\Python37-32\lib\threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\aakashba598\AppData\Local\Programs\Python\Python37-32\lib\subprocess.py", line 1215, in _readerthread
buffer.append(fh.read())
MemoryError
Also, what is the possible solution to this?
Update: When I reduced the dpi parameter from the convert_from_path function, it works like a charm. But the pictures produced are low quality (for obvious reasons). Is there a way to fix this? Like batch by batch creation of images and clearing memory everytime. If there is a way, how to go about it?
Convert the PDF in blocks of 10 pages each time ( 1-10,11-20 and so on ... )
from pdf2image import pdfinfo_from_path,convert_from_path
info = pdfinfo_from_path(pdf_file, userpw=None, poppler_path=None)
maxPages = info["Pages"]
for page in range(1, maxPages+1, 10) :
convert_from_path(pdf_file, dpi=200, first_page=page, last_page = min(page+10-1,maxPages))
I am a bit late to this, but the problem is indeed related to the 136 pages going into memory. You can do three things.
By default, pdf2image uses PPM as its image format, it is faster, but also takes a lot more memory (over 30MB per image!). What you can do to fix this is use a more memory-friendly format like jpeg or png.
convert_from_path('C:\path\to\your\pdf', fmt='jpeg')
That will probably solve the problem, but it's mostly just because of the compression, and at some point (say for +500pages PDF) the problem will reappear.
This is the one I would recommend because it allows you to process any PDF. The example on the README page explains it well:
import tempfile
with tempfile.TemporaryDirectory() as path:
images_from_path = convert_from_path('C:\path\to\your\pdf', output_folder=path)
This writes the image to your computer storage temporarily so you don't have to delete it manually. Make sure to do any processing you need to do before exiting the with
context though!
pdf2image
allows you to define the first an last page that you want to process. That means that in your case, with a PDF of 136 pages, you could do:
for i in range(0, 136 // 10 + 1):
convert_from_path('C:\path\to\your\pdf', first_page=i*10, last_page=(i+1)*10)
The accepted answer has a small issue.
maxPages = pdf2image._page_count(pdf_file)
can no longer be used, as _page_count
is deprecated. I found the working solution for the same.
from PyPDF2 import PdfFileWriter, PdfFileReader
inputpdf = PdfFileReader(open(pdf, "rb"))
maxPages = inputpdf.numPages
for page in range(1, maxPages, 100):
pil_images = pdf2image.convert_from_path(pdf, dpi=200, first_page=page,
last_page=min(page + 100 - 1, maxPages), fmt= 'jpg',
thread_count=1, userpw=None,
use_cropbox=False, strict=False)
This way, however large the file, it will process 100 at once and the ram usage is always minimal.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With