Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove some images and text objects from pdf

Tags:

python

pdf

pypdf2

I have a pdf page object with an image and a lot of text.

I want to remove that image and remove some text objects based on their contents. That is I want to get all text objects' contents, then remove some of them if they satisfied the condition.

How can I do that with PyPDF2? Or is there another library which allows doing that?

like image 432
sshilovsky Avatar asked Sep 20 '13 09:09

sshilovsky


1 Answers

To remove all images from a PDF file using PyPDF2 you can do:

from PyPDF2 import PdfFileWriter, PdfFileReader

inputStream = open("src.pdf", "rb")
outputStream = open("dst.pdf", "wb")

src = PdfFileReader(inputStream)
output = PdfFileWriter()

[output.addPage(src.getPage(i)) for i in range(src.getNumPages())]
output.removeImages()

output.write(outputStream)

like image 139
R‌‌‌.. Avatar answered Nov 11 '22 03:11

R‌‌‌..