Extract images from PDF without resampling, in python?

Tags:

How might one extract all images from a pdf document, at native resolution and format? (Meaning extract tiff as tiff, jpeg as jpeg, etc. and without resampling). Layout is unimportant, I don't care were the source image is located on the page.

I'm using python 2.7 but can use 3.x if required.

431

asked Apr 22 '10 19:04

matt wilkie

1 Answers

You can use the module PyMuPDF. This outputs all images as .png files, but worked out of the box and is fast.

import fitz doc = fitz.open("file.pdf") for i in range(len(doc)):     for img in doc.getPageImageList(i):         xref = img[0]         pix = fitz.Pixmap(doc, xref)         if pix.n < 5:       # this is GRAY or RGB             pix.writePNG("p%s-%s.png" % (i, xref))         else:               # CMYK: convert to RGB first             pix1 = fitz.Pixmap(fitz.csRGB, pix)             pix1.writePNG("p%s-%s.png" % (i, xref))             pix1 = None         pix = None

see here for more resources

117

answered Oct 12 '22 00:10

kateryna

Related questions
                            
                                pip is not able to install packages correctly: Permission denied error [duplicate]
                            
                                When to use get, get_queryset, get_context_data in Django?
                            
                                In Python try until no error
                            
                                How do I get the current IPython / Jupyter Notebook name
                            
                                How do I check whether this user is anonymous or actually a user on my system?
                            
                                How to give a pandas/matplotlib bar graph custom colors
                            
                                Can't get Python to import from a different folder
                            
                                Specifying a mySQL ENUM in a Django model
                            
                                Running javascript in Selenium using Python
                            
                                How to free disk space taken up by (ana)conda?
                            
                                Python: Platform independent way to modify PATH environment variable
                            
                                How to document Python code using Doxygen [closed]
                            
                                Print to the same line and not a new line?
                            
                                Split Python Flask app into multiple files
                            
                                Django - No such table: main.auth_user__old
                            
                                How does IPython's magic %paste work?
                            
                                is there a pythonic way to try something up to a maximum number of times? [duplicate]
                            
                                How to write UTF-8 in a CSV file
                            
                                Determining if root logger is set to DEBUG level in Python?
                            
                                drop into python interpreter while executing function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extract images from PDF without resampling, in python?

Tags:

python

image

pdf

extract

pypdf

matt wilkie

People also ask

1 Answers

kateryna

Recent Activity

Donate For Us