Extracting images from PDF using pypdfium2 (Python)

Question

I am trying to extract images from a PDF document using this specific library: pypdfium2 (https://pypi.org/project/pypdfium2/).

I would love to use PyMuPDF instead (given it's excellent speed and versatility), but because it uses a copy-left license I CANNOT use it for my workflow. So please don't provide an answer that advises me to use PyMuPDF.

Any suggestions are appreciated. I've looked through the docs but can't seem to find any image extraction methods.

To be clear, I am not trying to convert the PDF pages into images, I am trying to extract images within the document itself (assuming there are any). Images are typically embedded as either jpeg's or png's.

mara004 · Accepted Answer

pypdfium2 maintainer here. Yes, this is possible, and also documented. Take a look at PdfPage.get_objects() and PdfImage.extract() (or PdfImage.get_bitmap()).

There's also a built-in CLI pypdfium2 extract-images as testing utility. Its implementation demonstrates how to use the above APIs.

However, due to limitations in pdfium's public interface, pypdfium2 is by far not as good at image extraction as would technically be possible. You may want to consider pikepdf (MPL2-licensed), it's most sophisticated tool for this task IMHO.

(BTW, It's better to ask such questions on pypdfium2's discussions page on GitHub, then you're more likely to get a response.)

Extracting images from PDF using pypdfium2 (Python)

Tags:

python

pdf

image-extraction

americanthinker

1 Answers

mara004

Recent Activity

Donate For Us

Extracting images from PDF using pypdfium2 (Python)

Tags:

python

pdf

image-extraction

americanthinker

1 Answers

mara004

Related questions

Recent Activity

Donate For Us