Is there any way to extract images as stream from pdf document (using PyPDF2 library)? Also is it possible to replace some images to another (generated with PIL for example or loaded from file)? I'm able to get EncodedStreamObject from pdf objects tree and get encoded stream (by calling getData() method), but looks like it just raw content w/o any image headers and other meta information. <pre class="prettyprint lang-py prettyprint-override"><code>>>> import PyPDF2 >>> # sample.pdf contains png images >>> reader = PyPDF2.PdfFileReader(open('sample.pdf', 'rb')) >>> reader.resolvedObjects[0][9] {'/BitsPerComponent': 8, '/ColorSpace': ['/ICCBased', IndirectObject(20, 0)], '/Filter': '/FlateDecode', '/Height': 30, '/Subtype': '/Image', '/Type': '/XObject', '/Width': 100} >>> >>> reader.resolvedObjects[0][9].__class__ PyPDF2.generic.EncodedStreamObject >>> >>> s = reader.resolvedObjects[0][9].getData() >>> len(s), s[:10] (9000, '\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc') </code></pre> I've looked across PyPDF2, ReportLab and PDFMiner solutions quite a bit, but haven't found anything like what I'm looking for. Any code samples and links will be very helpful.

<pre class="prettyprint"><code>import fitz doc = fitz.open(filePath) for i in range(len(doc)): for img in doc.getPageImageList(i): xref = img[0] pix = fitz.Pixmap(doc, xref) if pix.n < 5: # this is GRAY or RGB pix.writePNG("p%s-%s.png" % (i, xref)) else: # CMYK: convert to RGB first pix1 = fitz.Pixmap(fitz.csRGB, pix) pix1.writePNG("p%s-%s.png" % (i, xref)) pix1 = None pix = None </code></pre>

Extract images from PDF using python PyPDF2

Is there any way to extract images as stream from pdf document (using PyPDF2 library)? Also is it possible to replace some images to another (generated with PIL for example or loaded from file)?

I'm able to get EncodedStreamObject from pdf objects tree and get encoded stream (by calling getData() method), but looks like it just raw content w/o any image headers and other meta information.

>>> import PyPDF2
>>> # sample.pdf contains png images
>>> reader = PyPDF2.PdfFileReader(open('sample.pdf', 'rb'))
>>> reader.resolvedObjects[0][9]
{'/BitsPerComponent': 8,
'/ColorSpace': ['/ICCBased', IndirectObject(20, 0)],
'/Filter': '/FlateDecode',
'/Height': 30,
'/Subtype': '/Image',
'/Type': '/XObject',
'/Width': 100}
>>>
>>> reader.resolvedObjects[0][9].__class__
PyPDF2.generic.EncodedStreamObject
>>>
>>> s = reader.resolvedObjects[0][9].getData()
>>> len(s), s[:10]
(9000, '\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc')

I've looked across PyPDF2, ReportLab and PDFMiner solutions quite a bit, but haven't found anything like what I'm looking for.

Any code samples and links will be very helpful.

How do I convert PDF to image in PyPDF2?

PyPDF2 also doesn't have any capabilities to convert a PDF file into an image, which is understandable since it does not use any core PDF libraries. So if you want to convert your PDF to an image file, the best you can do is extract text and write it to an image file.

import fitz
doc = fitz.open(filePath)
for i in range(len(doc)):
    for img in doc.getPageImageList(i):
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        if pix.n < 5:       # this is GRAY or RGB
            pix.writePNG("p%s-%s.png" % (i, xref))
        else:               # CMYK: convert to RGB first
            pix1 = fitz.Pixmap(fitz.csRGB, pix)
            pix1.writePNG("p%s-%s.png" % (i, xref))
            pix1 = None
        pix = None

Image metadata is not stored within the encoded images of a PDF. If metadata is stored at all, it is stored in PDF itself, but stripped from the underlying image. The metadata you see in your example is likely all that you'll be able to get. It's possible that PDF encoders may store image metadata elsewhere in the PDF, but I haven't seen this. (Note this metadata question was also asked for Java.)

It's definitely possible to extract the stream however, as you mentioned, you use the getData operation.

As for replacing it, you'll need to create a new image object with the PDF, add it to the end, and update the indirect Object pointers accordingly. It will be difficult to do this with PyPdf2.

Extract images from PDF using python PyPDF2

Tags:

python

image-processing

pdf

pypdf

reportlab

Max Kamenkov

People also ask

2 Answers

jainam shah

speedplane

Recent Activity

Donate For Us

Extract images from PDF using python PyPDF2

Tags:

python

image-processing

pdf

pypdf

reportlab

Max Kamenkov

People also ask

2 Answers

jainam shah

speedplane

Related questions

Recent Activity

Donate For Us