I am trying to extract images from a PDF document using this specific library: pypdfium2 (https://pypi.org/project/pypdfium2/).
I would love to use PyMuPDF instead (given it's excellent speed and versatility), but because it uses a copy-left license I CANNOT use it for my workflow. So please don't provide an answer that advises me to use PyMuPDF.
Any suggestions are appreciated. I've looked through the docs but can't seem to find any image extraction methods.
To be clear, I am not trying to convert the PDF pages into images, I am trying to extract images within the document itself (assuming there are any). Images are typically embedded as either jpeg's or png's.
pypdfium2 maintainer here. Yes, this is possible, and also documented.
Take a look at PdfPage.get_objects()
and PdfImage.extract()
(or PdfImage.get_bitmap()
).
There's also a built-in CLI pypdfium2 extract-images
as testing utility. Its implementation demonstrates how to use the above APIs.
However, due to limitations in pdfium's public interface, pypdfium2 is by far not as good at image extraction as would technically be possible.
You may want to consider pikepdf
(MPL2-licensed), it's most sophisticated tool for this task IMHO.
(BTW, It's better to ask such questions on pypdfium2's discussions page on GitHub, then you're more likely to get a response.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With