Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract images from PDF without resampling, in python?

How might one extract all images from a pdf document, at native resolution and format? (Meaning extract tiff as tiff, jpeg as jpeg, etc. and without resampling). Layout is unimportant, I don't care were the source image is located on the page.

I'm using python 2.7 but can use 3.x if required.

like image 431
matt wilkie Avatar asked Apr 22 '10 19:04

matt wilkie


People also ask

How do I extract an image from a PDF without losing quality?

In preferences/general check the box that says 'use fixed resolution for snapshot tool' and set the resolution to your liking e.g., 300ppi or even higher. Then take a snapshot (tools/select & zoom/snapshot tool) and it will copy a high res copy to your clipboard. Then paste it from your clipboard where you want.

How extract specific data from PDF in Python?

There are a couple of Python libraries using which you can extract data from PDFs. For example, you can use the PyPDF2 library for extracting text from PDFs where text is in a sequential or formatted manner i.e. in lines or forms. You can also extract tables in PDFs through the Camelot library.


1 Answers

You can use the module PyMuPDF. This outputs all images as .png files, but worked out of the box and is fast.

import fitz doc = fitz.open("file.pdf") for i in range(len(doc)):     for img in doc.getPageImageList(i):         xref = img[0]         pix = fitz.Pixmap(doc, xref)         if pix.n < 5:       # this is GRAY or RGB             pix.writePNG("p%s-%s.png" % (i, xref))         else:               # CMYK: convert to RGB first             pix1 = fitz.Pixmap(fitz.csRGB, pix)             pix1.writePNG("p%s-%s.png" % (i, xref))             pix1 = None         pix = None 

see here for more resources

like image 117
kateryna Avatar answered Oct 12 '22 00:10

kateryna